summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
8 daysfeat: save sitemap to a filedemo
8 daysfeat: check for missing https://demo
This isn't as easy as modifying the parsed URL after the fact. This Stack Overflow post has some hints: https://stackoverflow.com/q/46719948/4570292
8 daysfeat: add header to xml outputdemo
8 dayswip: generate rough draft of sitemapdemo
8 daysdocs: expand findURLs godocdemo
8 daysdocs: add comment explaining purpose of log.Lshortfiledemo
8 daysrefactor: move html document creation to getBatchdemo
Also, if there are errors, I log them and simply return a nil slice.
8 dayschore: include shortfile printout in log invocationsdemo
9 daysfix: make select statement block unless communication takes placedemo
9 daysfeat: configure maxDepth from the command linedemo
Similar to maxURLs, a maxDepth of zero means no limit.
9 dayswip: prototype a max-depth limitationdemo
It's just bolted on with a constant right now though.
9 daysfeat: update the classic crawler to track depth via packetsdemo
9 daysrefactor: move packet definitions to their own filedemo
I also decided to make the packet datatype package private.
9 daysrefactor: move "packet conversion" into a separate functiondemo
10 daysdocs: add extensive commentsdemo
10 daysfeat: measure the depth where each URL is founddemo
10 daysfeat: add some prints to prove we need to select on Done()demo
10 daysrefactor: eliminate redundant select statementdemo
10 daysfix: make sure all workers terminate by the enddemo
10 daysfeat: add early termination condition based on maxURLsdemo
10 daysfeat: add break condition from worklist loopdemo
10 daysfeat: add the worker-pool-based crawer from TGPLdemo
10 daysdocs: add an "Awesome Go" section to the READMEdemo
10 daysdocs: save websites I usually use with this crawlerdemo
This might expand into a whole journal on what sites I've tried the crawler with, and what the results were.
10 daysfix: release semaphore at the proper timedemo
10 daysfeat: implement maxConcurrency using a buffered channel 'sema'demo
10 daysfeat: add cancellation featuredemo
10 daysfeat: hit 'em with the classic web crawlerdemo
10 daysrefactor: make deduplication part of main goroutinedemo
10 daysfeat: restore original "print 45 and hang" behaviordemo
10 daysfeat: change all channel payloads to pointer typesdemo
10 daysfeat: reveal bug in the channel linkage topologydemo
10 daysfeat: add some code to canceldemo
However, currently this never gets reached.
10 daysrefactor: use SplitSeq instead of Splitdemo
10 daysfeat: add gouroutine-leak profilingdemo
10 daysfeat: design worker-pool webcrawlerdemo
10 daysdocs: remove commentdemo
13 daysrefactor: move main logic into separate functiondemo
13 daysfeat: avoid sending empty URL slices to the worklistdemo
13 daysfeat: add diagnostics to prove code is buggydemo
13 daysdocs: add package godocdemo
14 daysdocs: expound help string for -max argumentdemo
14 daysfix: check *maxURLs > 0 casedemo
14 daysfeat: add maxURLs CLI flagdemo
I've also added some more input sanitization.
14 daysfeat: implement cancellationdemo
I've also added 1., 2., etc. to the printed list of URLs.
14 daysfeat: add semaphore to throttle concurrent GET requestsdemo
14 daysrefactor: remove intermediate variabledemo
14 daysdocs: add comment for claritydemo
14 daysfeat: implement simple BFS webcrawlerdemo
14 daysfeat: create empty main.godemo