parsing - Java Httpconnection preprocess url content for jsoup or other parser -


i have program, connects url java httpconnection. inputstream parsed jsoup. problem taking 1 second each url. webpage has 12000 lines of code, need specific area (about 500 lines within div), wondering if preprocess inputstream , handing on part of code jsoup parsing. have 100.000 pages crawl cannot handle within 1 day 1 server. hope kind of preprocessing can lower parsing time sth. 50-150 ms. allready checked jsoup parsing bottleneck , not internet connection / downloading.

i appreciate hints.

yes, of course solution on right track.

but problem - block of code in inputstream start? depends on html document code.

if it's quite specific can read stream , throw away bytes not matched start of block.

you can read input stream , use indexof or regexp pattern (regex more slower).

then prepend <html><body> , append </body></html> extracted string , here have jsoup parse


Comments

Popular posts from this blog

c# - Better 64-bit byte array hash -

webrtc - Which ICE candidate am I using and why? -

php - Zend Framework / Skeleton-Application / Composer install issue -