Heritrix – Home Page
Posted by shannonclark on January 7, 2004
Open source crawler used by the Internet Archive. Looks likely to be very useful, going to investigate further but certainly I can see many uses for a good, well written (and well behaving) web page crawler/archiver. Especially as a tool to help with my other AI research (i.e. didn’t really want to write a crawler myself, but I do need a large archive of websites/pages for much of what my AI research leads to).