summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
8 daysfeat: specify whether URL sources are both missing or both presentdemo
8 daysfeat: implement shortcode featuredemo
8 daysrefactor: move initial URL parsing into function 'convertToURL'demo
8 dayschore: add urls.csv shortcodes filedemo
This file contains convenient shortcode mappings to eliminate the need to remember URLs I scrape frequently for testing purposes.
8 daysdocs: add some commentsdemo
8 daysfeat: add a user agent header!demo
8 daysfeat: add "total entries" as part of XML commentdemo
This is useful mostly in the context of maxURLs=∞, but perhaps it could also help catch some error cases.
8 daysfeat: prettify stats when figure was passed in as 0demo
The code understands 0 as "no limit", but I want to convey the no-limit concept to readers of the file who don't have a notion of how the program works. So I convert 0 to ∞ in the string output here.
8 daysfeat: place comment before URL listingdemo
There could be a lot of URLs, so this should be more user-friendly.
8 daysfeat: add comment logging maxDepth and maxURLs inside xml outputdemo
8 dayschore: ignore xml outputdemo
8 daysfeat: save sitemap to a filedemo
8 daysfeat: check for missing https://demo
This isn't as easy as modifying the parsed URL after the fact. This Stack Overflow post has some hints: https://stackoverflow.com/q/46719948/4570292
8 daysfeat: add header to xml outputdemo
8 dayswip: generate rough draft of sitemapdemo
8 daysdocs: expand findURLs godocdemo
8 daysdocs: add comment explaining purpose of log.Lshortfiledemo
8 daysrefactor: move html document creation to getBatchdemo
Also, if there are errors, I log them and simply return a nil slice.
8 dayschore: include shortfile printout in log invocationsdemo
9 daysfix: make select statement block unless communication takes placedemo
9 daysfeat: configure maxDepth from the command linedemo
Similar to maxURLs, a maxDepth of zero means no limit.
9 dayswip: prototype a max-depth limitationdemo
It's just bolted on with a constant right now though.
9 daysfeat: update the classic crawler to track depth via packetsdemo
9 daysrefactor: move packet definitions to their own filedemo
I also decided to make the packet datatype package private.
9 daysrefactor: move "packet conversion" into a separate functiondemo
10 daysdocs: add extensive commentsdemo
10 daysfeat: measure the depth where each URL is founddemo
10 daysfeat: add some prints to prove we need to select on Done()demo
10 daysrefactor: eliminate redundant select statementdemo
10 daysfix: make sure all workers terminate by the enddemo
10 daysfeat: add early termination condition based on maxURLsdemo
10 daysfeat: add break condition from worklist loopdemo
10 daysfeat: add the worker-pool-based crawer from TGPLdemo
10 daysdocs: add an "Awesome Go" section to the READMEdemo
10 daysdocs: save websites I usually use with this crawlerdemo
This might expand into a whole journal on what sites I've tried the crawler with, and what the results were.
10 daysfix: release semaphore at the proper timedemo
10 daysfeat: implement maxConcurrency using a buffered channel 'sema'demo
10 daysfeat: add cancellation featuredemo
10 daysfeat: hit 'em with the classic web crawlerdemo
10 daysrefactor: make deduplication part of main goroutinedemo
10 daysfeat: restore original "print 45 and hang" behaviordemo
10 daysfeat: change all channel payloads to pointer typesdemo
10 daysfeat: reveal bug in the channel linkage topologydemo
10 daysfeat: add some code to canceldemo
However, currently this never gets reached.
10 daysrefactor: use SplitSeq instead of Splitdemo
10 daysfeat: add gouroutine-leak profilingdemo
10 daysfeat: design worker-pool webcrawlerdemo
10 daysdocs: remove commentdemo
12 daysrefactor: move main logic into separate functiondemo
13 daysfeat: avoid sending empty URL slices to the worklistdemo