Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vickyk <vicky...@gmail.com>
Subject Seed URL ingestor behavior.
Date Tue, 03 Jan 2017 17:27:01 GMT
Hello Guys,

I have got the following scenario

urls.txt
/http://localhost214:8080/

/vickey@vickey:~/development/crawler/apache-nutch-2.3.1/runtime/local/bin$
./nutch inject seedlocal/ -crawlId 1
InjectorJob: starting at 2017-01-03 22:37:01
InjectorJob: Injecting urlDir: seedlocal
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora
storage class.
InjectorJob: total number of urls rejected by filters: 0
*InjectorJob: total number of urls injected after normalization and
filtering: 2*
Injector: finished at 2017-01-03 22:37:06, elapsed: 00:00:04
/

/hbase(main):021:0> scan "1_webpage"
ROW                                  COLUMN+CELL                                         
                                                     
 localhost213:http:8080              column=f:fi, timestamp=1483463225679,
value=\x00'\x8D\x00                                                 
 localhost213:http:8080              column=f:ts, timestamp=1483463225679,
value=\x00\x00\x01YeL`\x15                                          
 localhost213:http:8080              column=mk:_injmrk_,
timestamp=1483463225679, value=y                                                      
 localhost213:http:8080              column=mk:dist,
timestamp=1483463225679, value=0                                                         

 localhost213:http:8080              column=mtdt:_csh_,
timestamp=1483463225679, value=?\x80\x00\x00                                           
 localhost213:http:8080              column=s:s, timestamp=1483463225679,
value=?\x80\x00\x00                                                  
 localhost214:http:8080              column=f:fi, timestamp=1483463225827,
value=\x00'\x8D\x00                                                 
 localhost214:http:8080              column=f:ts, timestamp=1483463225827,
value=\x00\x00\x01YeL`\x15                                          
 localhost214:http:8080              column=mk:_injmrk_,
timestamp=1483463225827, value=y                                                      
 localhost214:http:8080              column=mk:dist,
timestamp=1483463225827, value=0                                                         

 localhost214:http:8080              column=mtdt:_csh_,
timestamp=1483463225827, value=?\x80\x00\x00                                           
 localhost214:http:8080              column=s:s, timestamp=1483463225827,
value=?\x80\x00\x00                                                  
2 row(s) in 0.0360 seconds
/


I deleted the 1_webpage

/hbase(main):022:0> disable "1_webpage"
0 row(s) in 1.6340 seconds/

/hbase(main):023:0> drop "1_webpage"
0 row(s) in 0.2340 seconds/

/hbase(main):024:0> scan "1_webpage"
ROW                                  COLUMN+CELL                                         
                                                     

*ERROR: Unknown table 1_webpage!*
/

Next I injected the same seed url again

/vickey@vickey:~/development/crawler/apache-nutch-2.3.1/runtime/local/bin$
./nutch inject seedlocal/ -crawlId 1
InjectorJob: starting at 2017-01-03 22:47:19
InjectorJob: Injecting urlDir: seedlocal
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora
storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 2
Injector: finished at 2017-01-03 22:47:24, elapsed: 00:00:04/
/
hbase(main):025:0> scan "1_webpage"
ROW                                  COLUMN+CELL                                         
                                                     
 localhost213:http:8080              column=f:fi, timestamp=1483463843514,
value=\x00'\x8D\x00                                                 
 localhost213:http:8080              column=f:ts, timestamp=1483463843514,
value=\x00\x00\x01YeU\xCDv                                          
 localhost213:http:8080              column=mk:_injmrk_,
timestamp=1483463843514, value=y                                                      
 localhost213:http:8080              column=mk:dist,
timestamp=1483463843514, value=0                                                         

 localhost213:http:8080              column=mtdt:_csh_,
timestamp=1483463843514, value=?\x80\x00\x00                                           
 localhost213:http:8080              column=s:s, timestamp=1483463843514,
value=?\x80\x00\x00                                                  
 localhost214:http:8080              column=f:fi, timestamp=1483463843666,
value=\x00'\x8D\x00                                                 
 localhost214:http:8080              column=f:ts, timestamp=1483463843666,
value=\x00\x00\x01YeU\xCDv                                          
 localhost214:http:8080              column=mk:_injmrk_,
timestamp=1483463843666, value=y                                                      
 localhost214:http:8080              column=mk:dist,
timestamp=1483463843666, value=0                                                         

 localhost214:http:8080              column=mtdt:_csh_,
timestamp=1483463843666, value=?\x80\x00\x00                                           
 localhost214:http:8080              column=s:s, timestamp=1483463843666,
value=?\x80\x00\x00                                                  
2 row(s) in 0.0460 seconds/


Shouldn't deleting the 1_webpage table from the HBase not clear all the
entries. Please note that the seed url entry is 
*http://localhost214:8080* I have been expecting its entry in the 1_webpage,
but it is showing the other.

Why do I see the *http://localhost213:8080* entry? I guess it is coming from
the file system.
I would like to know here before I go and start digging the code. 

Thanks,
Vicky







--
View this message in context: http://lucene.472066.n3.nabble.com/Seed-URL-ingestor-behavior-tp4312095.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Mime
View raw message