urls - Web crawler in Go.

Age	Commit message (Collapse)	Author
2026-05-28	feat: specify whether URL sources are both missing or both present	demo

2026-05-28	feat: implement shortcode feature	demo

2026-05-28	refactor: move initial URL parsing into function 'convertToURL'	demo

2026-05-28	chore: add urls.csv shortcodes file	demo
	This file contains convenient shortcode mappings to eliminate the need to remember URLs I scrape frequently for testing purposes.
2026-05-28	docs: add some comments	demo

2026-05-28	feat: add a user agent header!	demo

2026-05-28	feat: add "total entries" as part of XML comment	demo
	This is useful mostly in the context of maxURLs=∞, but perhaps it could also help catch some error cases.
2026-05-28	feat: prettify stats when figure was passed in as 0	demo
	The code understands 0 as "no limit", but I want to convey the no-limit concept to readers of the file who don't have a notion of how the program works. So I convert 0 to ∞ in the string output here.
2026-05-28	feat: place comment before URL listing	demo
	There could be a lot of URLs, so this should be more user-friendly.
2026-05-28	feat: add comment logging maxDepth and maxURLs inside xml output	demo

2026-05-28	chore: ignore xml output	demo

2026-05-28	feat: save sitemap to a file	demo

2026-05-28	feat: check for missing https://	demo
	This isn't as easy as modifying the parsed URL after the fact. This Stack Overflow post has some hints: https://stackoverflow.com/q/46719948/4570292
2026-05-28	feat: add header to xml output	demo

2026-05-28	wip: generate rough draft of sitemap	demo

2026-05-28	docs: expand findURLs godoc	demo

2026-05-28	docs: add comment explaining purpose of log.Lshortfile	demo

2026-05-28	refactor: move html document creation to getBatch	demo
	Also, if there are errors, I log them and simply return a nil slice.
2026-05-28	chore: include shortfile printout in log invocations	demo

2026-05-27	fix: make select statement block unless communication takes place	demo

2026-05-27	feat: configure maxDepth from the command line	demo
	Similar to maxURLs, a maxDepth of zero means no limit.
2026-05-27	wip: prototype a max-depth limitation	demo
	It's just bolted on with a constant right now though.
2026-05-27	feat: update the classic crawler to track depth via packets	demo

2026-05-27	refactor: move packet definitions to their own file	demo
	I also decided to make the packet datatype package private.
2026-05-27	refactor: move "packet conversion" into a separate function	demo

2026-05-26	docs: add extensive comments	demo

2026-05-26	feat: measure the depth where each URL is found	demo

2026-05-26	feat: add some prints to prove we need to select on Done()	demo

2026-05-26	refactor: eliminate redundant select statement	demo

2026-05-26	fix: make sure all workers terminate by the end	demo

2026-05-26	feat: add early termination condition based on maxURLs	demo

2026-05-26	feat: add break condition from worklist loop	demo

2026-05-26	feat: add the worker-pool-based crawer from TGPL	demo

2026-05-26	docs: add an "Awesome Go" section to the README	demo

2026-05-26	docs: save websites I usually use with this crawler	demo
	This might expand into a whole journal on what sites I've tried the crawler with, and what the results were.
2026-05-26	fix: release semaphore at the proper time	demo

2026-05-26	feat: implement maxConcurrency using a buffered channel 'sema'	demo

2026-05-26	feat: add cancellation feature	demo

2026-05-26	feat: hit 'em with the classic web crawler	demo

2026-05-26	refactor: make deduplication part of main goroutine	demo

2026-05-26	feat: restore original "print 45 and hang" behavior	demo

2026-05-26	feat: change all channel payloads to pointer types	demo

2026-05-26	feat: reveal bug in the channel linkage topology	demo

2026-05-26	feat: add some code to cancel	demo
	However, currently this never gets reached.
2026-05-26	refactor: use SplitSeq instead of Split	demo

2026-05-26	feat: add gouroutine-leak profiling	demo

2026-05-26	feat: design worker-pool webcrawler	demo

2026-05-26	docs: remove comment	demo

2026-05-24	refactor: move main logic into separate function	demo

2026-05-23	feat: avoid sending empty URL slices to the worklist	demo