Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vickyk <vicky...@gmail.com>
Subject Re: Crawling to send data to Kafka.
Date Thu, 05 Jan 2017 06:01:32 GMT
Hey Guys,

Thanks for the useful details.

I have looked at the implementation and a minor review from myside, the
getJSONString() should be moved to the common util location, I could not
find the appropraite utility class. May be some CommonUtility class can hold
the common methods at 
*apache-nutch-2.3.1/src/java/org/apache/nutch/util*

The NUTCH-2132 seems to be emitting the events letting the Kafka about the
fetching/parsing status of the URL's, I did look at the code. I have been
using 2.3.1 version and the fix seems to be done for 1.3, so I may have to
port the JIRA if I need to use this feature.

Our requirement is little different, I would expect the parsed contents to
be send to the Kafka in a specific format which we can define in avro
schema. I have been using gobblin for ETL and have defined the schema for
kafka messaging, check this 
http://gobblin.readthedocs.io/en/latest/Getting-Started/#other-example-jobs
<http://http://gobblin.readthedocs.io/en/latest/Getting-Started/#other-example-jobs>
 

I have got couple of ways to handle it 
1) Consume the Kafka events which indicate the fetching is done, the
consumer should parse the URL and extract the content and process it.
2) The Kafka Plugin needs to be modified so that the parsed contents too are
published to Kafka, this way we are not making multiple calls to the site
which is being crawled. However this will have a tradeoff of causing more
consumtion of network bandwidth as the messages containing the contents will
pass through the network.

Let me hear more from you.

Thanks,
Vicky



--
View this message in context: http://lucene.472066.n3.nabble.com/Crawling-to-send-data-to-Kafka-tp4312320p4312452.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Mime
View raw message