parsing - Java Httpconnection preprocess url content for jsoup or other parser -
i have program, connects url java httpconnection. inputstream parsed jsoup. problem taking 1 second each url. webpage has 12000 lines of code, need specific area (about 500 lines within div), wondering if preprocess inputstream , handing on part of code jsoup parsing. have 100.000 pages crawl cannot handle within 1 day 1 server. hope kind of preprocessing can lower parsing time sth. 50-150 ms. allready checked jsoup parsing bottleneck , not internet connection / downloading.
i appreciate hints.
yes, of course solution on right track.
but problem - block of code in inputstream
start? depends on html document code.
if it's quite specific can read stream , throw away bytes not matched start of block.
you can read input stream , use indexof
or regexp
pattern (regex more slower).
then prepend <html><body>
, append </body></html>
extracted string
, here have jsoup
parse
Comments
Post a Comment