# core core is the fetch-and-read layer. [`fetch()`](https://vedicreader.github.io/fossick/cli.html#fetch) handles any URL from a simple static page to a JavaScript-rendered SPA; [`to_md()`](https://vedicreader.github.io/fossick/core.html#to_md) strips HTML to clean markdown. The module also covers pagination across JSON APIs, reading arXiv papers and YouTube transcripts, and cloning GitHub repos. ------------------------------------------------------------------------ source ### save_path ``` python def save_path( path:NoneType=None ): ``` *Get cache path for `name` (e.g. ‘arxiv’ or ‘fetch’)* ------------------------------------------------------------------------ source ### http_post ``` python def http_post( url, kw:VAR_KEYWORD ): ``` *Call self as a function.* ------------------------------------------------------------------------ source ### http_get ``` python def http_get( url, kw:VAR_KEYWORD ): ``` *Call self as a function.* ------------------------------------------------------------------------ source ### get_pdf ``` python def get_pdf( url:str ): ``` *Fetch PDF from URL and return as PdfDocument* ------------------------------------------------------------------------ source ### read_arxiv ``` python def read_arxiv( url:str, # arxiv PDF URL, or arxiv abstract URL, or arxiv ID save_pdf:bool=True, # if True, saves the downloaded PDF to disk save_dir:str='.', # directory in which to save the PDF force:bool=False, # if True, forces re-download of PDF even if it exists on disk ): ``` *Get paper information from arxiv URL or ID, optionally saving PDF to disk* ------------------------------------------------------------------------ source ### read_gh_repo ``` python def read_gh_repo( path_or_url:str, # GitHub URL, SSH address, or local path globs:tuple=None, # file glob patterns (default: README*, pyproject.toml, *.py) limit:int=None, # max files to return as_list:bool=False, # return list of Paths instead of {path: content} dict ): ``` *Read files from a GitHub repo filtered by glob patterns* ------------------------------------------------------------------------ source ### read_gh_file ``` python def read_gh_file( url:str, # GitHub blob URL of the file to read ): ``` *Read raw contents of a file from its GitHub URL* ``` python read_arxiv('https://arxiv.org/abs/2306.14881')['summary'][:200] ``` 'Low-metallicity dwarf galaxies often show no or little CO emission, despite the intense star formation observed in local samples. Both simulations and resolved observations indicate that molecular gas' ``` python read_gh_file('https://github.com/Karthik777/litesearch/blob/main/README.md')[:200] ``` '# litesearch\n\n\n\n\n> **NB** Reading this on GitHub? The formatted\n> [documentation](https://Karthik777.github.io/litesearch/) is nicer.\n\nlitese' ``` python list(read_gh_repo('https://github.com/vedicreader/gheasy')) ``` ['/Users/71293/.cache/.fossick/git_clones/gheasy/README.md', '/Users/71293/.cache/.fossick/git_clones/gheasy/pyproject.toml', '/Users/71293/.cache/.fossick/git_clones/gheasy/gheasy/__init__.py', '/Users/71293/.cache/.fossick/git_clones/gheasy/gheasy/_modidx.py', '/Users/71293/.cache/.fossick/git_clones/gheasy/gheasy/core.py', '/Users/71293/.cache/.fossick/git_clones/gheasy/gheasy/workflow.py'] ## Web Fetching [`fetch()`](https://vedicreader.github.io/fossick/cli.html#fetch) returns a dict with `url`, `status`, `html`, `data` (parsed JSON when the response is JSON), and `xhr` (captured network calls when `capture_xhr=True`). [`to_md()`](https://vedicreader.github.io/fossick/core.html#to_md) produces clean markdown, optionally extracting just the element matched by a CSS selector. ------------------------------------------------------------------------ source ### to_md ``` python def to_md( page_or_html, # Page dict (from fetch/crawl) or raw HTML string sel:str=None, # CSS selector to extract before conversion; returns '' if no match multi:bool=False, # Return all selector matches joined wrap_tag:str=None, # Wrap each multi-result in ...; only used when multi=True ignore_links:bool=True, rm_comments:bool=True, rm_details:bool=True )->str: ``` *Convert a Page dict or HTML string to clean markdown* ------------------------------------------------------------------------ source ### html2md ``` python def html2md( s:str, ignore_links:bool=True ): ``` *Convert `s` from HTML to markdown* ------------------------------------------------------------------------ source ### clean_md ``` python def clean_md( text, rm_comments:bool=True, rm_details:bool=True ): ``` *Remove comments and `

` sections from `text`* ------------------------------------------------------------------------ source ### fetch ``` python def fetch( url:str, # URL to fetch sel:str=None, # CSS selector to extract (None = full page) method:str='GET', # HTTP method; 'POST' sends payload as JSON body payload:dict=None, # POST body (JSON) or GET query params heavy:bool=False, # Full JS rendering via headless browser stealthy:bool=False, # Anti-bot stealth fetcher (Cloudflare etc.) capture_xhr:bool=False, # Intercept XHR/fetch calls; forces heavy=True cache:bool=False, # Cache successful responses to disk by URL+sel force:bool=False, # If True, forces re-fetch even if cached result exists kw:VAR_KEYWORD )->dict: # Extra kwargs passed to scrapling (e.g. verify, headers) ``` *Fetch `url`, return Page dict {url, status, html, data, xhr} where html is raw response body* ------------------------------------------------------------------------ source ### crawl ``` python def crawl( start_url:str, # URL to start from sel:str=None, # CSS selector to extract per page follow_sel:str='a[href]', # CSS selector for links to follow same_domain:bool=True, # Only follow links on same domain max_pages:int=10, # Max pages to visit delay:float=0, # Seconds to wait between requests (polite crawling) heavy:bool=False, stealthy:bool=False, kw:VAR_KEYWORD )->list: # Extra kwargs passed to scrapling (e.g. verify, timeout) ``` *Crawl from `start_url`, following `follow_sel` links, return list of Page dicts* ------------------------------------------------------------------------ source ### fetch_all ``` python def fetch_all( urls:list, # URLs to fetch sel:str=None, # CSS selector to extract per page (None = full page) concurrency:int=8, # Max parallel fetches heavy:bool=False, stealthy:bool=False, kw:VAR_KEYWORD )->list: # Extra kwargs passed to fetch() ``` *Fetch a list of URLs in parallel; returns Page dicts in the same order as urls* ------------------------------------------------------------------------ source ### get_options ``` python def get_options( page_or_html, # Page dict (from fetch) or raw HTML string sel:str, # CSS selector for the element; returns \[{‘value’: …, ‘text’: …}\]* ``` python _sel_html = ''' ''' opts = get_options(_sel_html, '#kanda') assert opts == [{'value': '1', 'text': 'Balakanda'}, {'value': '2', 'text': 'Ayodhyakanda'}, {'value': '3', 'text': 'Aranyakanda'}] # accepts Page dict _page = {'url': 'x', 'status': 200, 'html': _sel_html, 'data': None, 'xhr': []} assert get_options(_page, '#kanda') == opts # no match → empty list assert get_options(_sel_html, '#missing') == [] # fetch_all: parallel fetch, order preserved _urls = ['https://httpbin.org/get', 'https://httpbin.org/ip'] _pages = fetch_all(_urls, verify=False) assert len(_pages) == 2 assert _pages[0]['url'] == _urls[0] assert _pages[1]['url'] == _urls[1] assert all(p['status'] == 200 for p in _pages) # live: Valmiki Ramayana — discover sargas 1–3 of Balakanda and Ayodhyakanda, fetch in parallel _base = 'https://www.valmiki.iitk.ac.in/sloka' _home = fetch(f'{_base}?field_kanda_tid=1&language=dv&field_sarga_value=1') _kandas = [o for o in get_options(_home, '#edit-field-kanda-tid') if o['value']] # drop placeholder assert len(_kandas) >= 6, f"Expected ≥6 kandas, got {len(_kandas)}: {_kandas}" assert any('BALA' in k['text'].upper() for k in _kandas) for _k in _kandas[:2]: # Balakanda, Ayodhyakanda _kp = fetch(f'{_base}?field_kanda_tid={_k["value"]}&language=dv&field_sarga_value=1') _sargas = [o for o in get_options(_kp, '#edit-field-sarga-value') if o['value']] assert len(_sargas) > 0, f"No sargas found for {_k['text']}" _urls = [f'{_base}?field_kanda_tid={_k["value"]}&language=dv&field_sarga_value={s["value"]}' for s in _sargas[:3]] _pages = fetch_all(_urls, sel='.view-content') assert len(_pages) == 3 assert all(p['status'] == 200 for p in _pages) assert all(len(to_md(p)) > 50 for p in _pages), "Expected non-trivial markdown content" print(f"{_k['text']}: {len(_sargas)} sargas, first 3 fetched OK") print(f" sarga 1 preview: {to_md(_pages[0])[:120]!r}") ``` [2026-06-03 08:26:24] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:24] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:26] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:26] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:27] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:28] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:28] INFO: Fetched (200) (referer: https://www.google.com/) BALAKANDA: 77 sargas, first 3 fetched OK sarga 1 preview: '[Saint Narada visits hermitage of Valmiki -- Valmiki queries about a single perfect individual bestowed with all good qu' [2026-06-03 08:26:29] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:30] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:30] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:26:30] INFO: Fetched (200) (referer: https://www.google.com/) AYODHYAKANDA: 119 sargas, first 3 fetched OK sarga 1 preview: "[Description of Rama's virtues Dasaratha contemplates to install Rama as heirapparent Invites kings and elders from town" ## API Discovery & Pagination [`find_xhr()`](https://vedicreader.github.io/fossick/core.html#find_xhr) visits a page with a real browser, captures all XHR and fetch calls it makes, and returns those matching a URL pattern. This surfaces the undocumented JSON endpoints that JavaScript-heavy sites use to load their data. [`paginate_api()`](https://vedicreader.github.io/fossick/core.html#paginate_api) replays one of those captured requests across pages until results are exhausted. ------------------------------------------------------------------------ source ### find_xhr ``` python def find_xhr( url:str, # URL to visit with browser pattern:str='*', # Glob or regex pattern to filter captured XHR URLs json_only:bool=True, # Return only JSON responses kw:VAR_KEYWORD )->list: # Extra kwargs passed to fetch (verify, network_idle, etc.) ``` *Visit `url` with a headless browser, return \[{url, content_type, data}\] for each XHR/fetch call made* ------------------------------------------------------------------------ source ### compile_pattern ``` python def compile_pattern( pattern ): ``` *Compile pattern as regex; if invalid (e.g. bare glob like* foo*), convert via fnmatch first* ``` python assert compile_pattern('.*products.*').search('https://api.example.com/products?q=1') assert compile_pattern('*products*').search('https://api.example.com/products?q=1') assert not compile_pattern('*products*').search('https://api.example.com/search') assert compile_pattern('.*[Ss]earch.*').search('https://api.example.com/Search') ``` ------------------------------------------------------------------------ source ### paginate_api ``` python def paginate_api( url:str, # API endpoint URL payload:dict=None, # Request body (POST) or params (GET) page_field:str='pageNumber', # Payload key to increment for each page size_field:str='pageSize', # Payload key for page size (detects last page) results_field:str=None, # Response key with items list (auto-detect if None) method:str='POST', # HTTP method max_pages:int=10, page_size:int=24, # Page size to request (only used if not in payload) page_start:int=1, # Starting page number (default 1) save:bool=False, # If True, saves each page's items to disk save_file:str='{url}_page_{page}.json', # Filepath pattern for saving (only used if save=True) force:bool=False, # If True, forces re-fetching even if saved file exists kw:VAR_KEYWORD )->list: # Extra kwargs passed to fetch() (verify, headers, etc.) ``` *Paginate through a JSON API, collecting all results. Auto-detects the items list in response.* ``` python from fastcore.test import test_eq ``` ``` python test_eq(clean_md('before after'), 'before after') # surrounding newlines are consumed along with the block test_eq(clean_md('a\n

hidden

\nb'), 'ab') test_eq(clean_md('a\n\n

hidden

\n\nb'), 'a\n\nb') test_eq(clean_md('no change', rm_comments=False, rm_details=False), 'no change') md = html2md('

Hello

World

') assert '# Hello' in md and 'World' in md # to_md tests html_ = '

Hello

World

Keep

' _md = to_md(html_) assert 'Hello' in _md assert 'World' in _md # accepts Page dict — extracts html field _page = {'url': 'https://example.com', 'status': 200, 'html': html_, 'data': None, 'xhr': []} assert to_md(_page) == _md # sel extracts first matching element only _md_h1 = to_md(html_, sel='h1') assert 'Hello' in _md_h1 assert 'World' not in _md_h1 # multi=True returns all matches joined _md_ps = to_md(html_, sel='p', multi=True) assert 'World' in _md_ps assert 'Keep' in _md_ps # sel with no match returns empty string _md_none = to_md(html_, sel='div.missing') assert _md_none == '' or len(_md_none) < 5 # html2md of empty string may produce minimal whitespace # wrap_tag wraps each multi-result _md_wrapped = to_md(html_, sel='p', multi=True, wrap_tag='item') assert '' in _md_wrapped assert '' in _md_wrapped ``` ``` python _pg = fetch('https://httpbin.org/get', verify=False) assert isinstance(_pg, dict), f"Expected dict, got {type(_pg)}" assert set(_pg.keys()) == {'url', 'status', 'html', 'data', 'xhr'}, f"Keys mismatch: {_pg.keys()}" assert _pg['status'] == 200, f"Expected 200, got {_pg['status']}" assert _pg['xhr'] == [], "xhr should be empty without capture_xhr" assert len(_pg['html']) > 0, "html should be non-empty" # httpbin.org/get returns JSON — data should be parsed dict assert _pg['data'] is not None, "data should be parsed JSON for a JSON response" assert _pg['data']['url'] == 'https://httpbin.org/get' # to_md integration — fetch + convert _text = to_md(_pg) assert isinstance(_text, str) and len(_text) > 0 ``` [2026-06-03 08:26:53] INFO: Fetched (200) (referer: https://www.google.com/) ``` python _pages = crawl('https://httpbin.org', max_pages=2, verify=False) assert isinstance(_pages, list), f"Expected list, got {type(_pages)}" assert len(_pages) > 0, "Expected at least one page" assert all(isinstance(p, dict) for p in _pages) assert all(set(p.keys()) == {'url', 'status', 'html', 'data', 'xhr'} for p in _pages), \ f"Unexpected keys: {[set(p.keys()) for p in _pages]}" assert all(p['status'] == 200 for p in _pages), "Non-200 pages should be skipped" assert all(len(p['html']) > 0 for p in _pages), "html should be non-empty" assert len({p['url'] for p in _pages}) == len(_pages), "url values should be unique" ``` [2026-06-03 08:27:03] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:03] INFO: Fetched (200) (referer: https://www.google.com/) ``` python # Step 1: visit the listing page with a headless browser, capture all XHR/fetch calls apis = find_xhr('https://www.danmurphys.com.au/list/wine-all', verify=False) ``` [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (404) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/list/wine-all) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: None) [2026-06-03 08:27:22] ERROR: Error getting page content: Response.body: Response body is unavailable for redirect responses [2026-06-03 08:27:22] INFO: Fetched (302) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/list/wine-all) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://murphystorage.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://murphystorage.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://murphystorage.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://murphystorage.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://murphystorage.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://widgets.merchants.danmurphys.com.au/) [2026-06-03 08:27:22] ERROR: Error getting page content: Response.body: Protocol error (Network.getResponseBody): No data found for resource with given identifier [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] ERROR: Error getting page content: Response.body: Protocol error (Network.getResponseBody): No data found for resource with given identifier [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] ERROR: Error getting page content: Response.body: Protocol error (Network.getResponseBody): No data found for resource with given identifier [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] ERROR: Error getting page content: Response.body: Protocol error (Network.getResponseBody): No data found for resource with given identifier [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] ERROR: Error getting page content: Response.body: Protocol error (Network.getResponseBody): No data found for resource with given identifier [2026-06-03 08:27:22] INFO: Fetched (400) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (202) (referer: https://widgets.merchants.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) [2026-06-03 08:27:22] INFO: Fetched (200) (referer: https://www.danmurphys.com.au/) ``` python # Dan Murphy's wines — full workflow: discover API → paginate → save JSON # # Dan Murphy's is a SPA: the product listing page loads products via a hidden JSON API. # Step 1 visits with a real browser to intercept those calls; step 2 replays the API # directly (no browser needed) to collect all 120 wines with full pricing data. # find the product API — large JSON response containing 'Items' list wine_api = next( a for a in apis if 'api.danmurphys.com.au/apis/ui/ProductGroup/Products/wine%20all' in a['url'] and isinstance(a.get('data'), dict) and 'Items' in a['data'] ) print(f"API endpoint : {wine_api['url']}") print(f"Response keys: {list(wine_api['data'].keys())}") print(f"Total available: {wine_api['data'].get('TotalRecordCount', '?')} products") _items = wine_api['data']['Items'] _sample = next(iter(_items.values() if isinstance(_items, dict) else _items)) print(f"Fields per item: {list(_sample.keys())}") # Step 2: paginate the API directly — 5 pages × 24 = 120 wines, no browser required wines = paginate_api( wine_api['url'], payload={ 'pageSize': 5, 'pageNumber': 1, 'sortType': 'Relevance', 'Location': 'ProductGroup', 'Filters': [], 'ShowOnlyAvailable': False, }, page_field='pageNumber', size_field='pageSize', results_field='Items', max_pages=1, save=True, save_file='test_page_{page}.json', verify=False, ) print(f"\nCollected {len(wines)} wines") # Step 3: save full product data — includes Price, PromoPrice, Name, Brand, Rating, etc. Path('wines.json').write_text(json.dumps(wines, indent=2)) print(f"Saved to wines.json ({Path('wines.json').stat().st_size // 1024} KB)") # preview first result wines[0] ``` API endpoint : https://api.danmurphys.com.au/apis/ui/ProductGroup/Products/wine%20all Response keys: ['Aggregations', 'Banners', 'Cards', 'DisplayName', 'SearchSource', 'Items', 'TotalRecordCount'] Total available: 7837 products Fields per item: ['Name', 'PackDefaultStockCode', 'PackParentStockCode', 'Products', 'PackMessage', 'IsInDefaultList', 'IsPersonalised'] Page 1 already saved, skipping fetch Collected 7 wines Saved to wines.json (0 KB) 'Aggregations' ``` python wines=paginate_api(wine_api['url'], page_field='pageNumber', size_field='pageSize', results_field='Items', payload={'pageSize': 48, 'pageNumber': 1, 'sortType': 'Relevance', 'Location': 'ProductGroup', 'Filters': [], 'ShowOnlyAvailable': False}, max_pages=100, save=True, save_file='downloads/danmurphys_wines_page_{page}.json', verify=False) Path('wines.json').write_text(json.dumps(wines, indent=2)) print(f"Saved to wines.json ({Path('wines.json').stat().st_size // 1024} KB)") ``` Page 1 already saved, skipping fetch Page 2 already saved, skipping fetch Page 3 already saved, skipping fetch Page 4 already saved, skipping fetch Page 5 already saved, skipping fetch Page 6 already saved, skipping fetch Page 7 already saved, skipping fetch Page 8 already saved, skipping fetch Page 9 already saved, skipping fetch Page 10 already saved, skipping fetch Page 11 already saved, skipping fetch Page 12 already saved, skipping fetch Page 13 already saved, skipping fetch Page 14 already saved, skipping fetch Page 15 already saved, skipping fetch Page 16 already saved, skipping fetch Page 17 already saved, skipping fetch Page 18 already saved, skipping fetch Page 19 already saved, skipping fetch Page 20 already saved, skipping fetch Page 21 already saved, skipping fetch Page 22 already saved, skipping fetch Page 23 already saved, skipping fetch Page 24 already saved, skipping fetch Page 25 already saved, skipping fetch Page 26 already saved, skipping fetch Page 27 already saved, skipping fetch Page 28 already saved, skipping fetch Page 29 already saved, skipping fetch Page 30 already saved, skipping fetch Page 31 already saved, skipping fetch Page 32 already saved, skipping fetch Page 33 already saved, skipping fetch Page 34 already saved, skipping fetch Page 35 already saved, skipping fetch Page 36 already saved, skipping fetch Page 37 already saved, skipping fetch Page 38 already saved, skipping fetch Page 39 already saved, skipping fetch Page 40 already saved, skipping fetch Page 41 already saved, skipping fetch Page 42 already saved, skipping fetch Page 43 already saved, skipping fetch Page 44 already saved, skipping fetch Page 45 already saved, skipping fetch Page 46 already saved, skipping fetch Page 47 already saved, skipping fetch Page 48 already saved, skipping fetch Page 49 already saved, skipping fetch Page 50 already saved, skipping fetch Page 51 already saved, skipping fetch Page 52 already saved, skipping fetch Page 53 already saved, skipping fetch Page 54 already saved, skipping fetch Page 55 already saved, skipping fetch Page 56 already saved, skipping fetch Page 57 already saved, skipping fetch Page 58 already saved, skipping fetch Page 59 already saved, skipping fetch Page 60 already saved, skipping fetch Page 61 already saved, skipping fetch Page 62 already saved, skipping fetch Page 63 already saved, skipping fetch Page 64 already saved, skipping fetch Page 65 already saved, skipping fetch Page 66 already saved, skipping fetch Page 67 already saved, skipping fetch Page 68 already saved, skipping fetch Page 69 already saved, skipping fetch Page 70 already saved, skipping fetch Page 71 already saved, skipping fetch Page 72 already saved, skipping fetch Page 73 already saved, skipping fetch Page 74 already saved, skipping fetch Page 75 already saved, skipping fetch Page 76 already saved, skipping fetch Page 77 already saved, skipping fetch Page 78 already saved, skipping fetch Page 79 already saved, skipping fetch Page 80 already saved, skipping fetch Page 81 already saved, skipping fetch Page 82 already saved, skipping fetch Page 83 already saved, skipping fetch Page 84 already saved, skipping fetch Page 85 already saved, skipping fetch Page 86 already saved, skipping fetch Page 87 already saved, skipping fetch Page 88 already saved, skipping fetch Page 89 already saved, skipping fetch Page 90 already saved, skipping fetch Page 91 already saved, skipping fetch Page 92 already saved, skipping fetch Page 93 already saved, skipping fetch Page 94 already saved, skipping fetch Page 95 already saved, skipping fetch Page 96 already saved, skipping fetch Page 97 already saved, skipping fetch Page 98 already saved, skipping fetch Page 99 already saved, skipping fetch Page 100 already saved, skipping fetch Saved to wines.json (10 KB) ``` python L(apis).filter(lambda a: 'api.danmurphys.com.au/apis/ui/ProductGroup/Products/wine%20all' in a['url'])[0]['url'] ``` 'https://api.danmurphys.com.au/apis/ui/ProductGroup/Products/wine%20all' ``` python # paginate_api: test with JSONPlaceholder (free public REST API, GET-based) posts = paginate_api( 'https://jsonplaceholder.typicode.com/posts', payload={'_page': 1, '_limit': 5}, page_field='_page', size_field='_limit', method='GET', verify=False, ) assert len(posts) >= 5 assert 'title' in posts[0] ``` [2026-06-03 08:27:32] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:33] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:33] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:34] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:35] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:36] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:36] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:37] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:38] INFO: Fetched (200) (referer: https://www.google.com/) [2026-06-03 08:27:38] INFO: Fetched (200) (referer: https://www.google.com/) ## YouTube [`search_yt()`](https://vedicreader.github.io/fossick/cli.html#search_yt) runs a YouTube search and returns metadata for each video. [`read_yt()`](https://vedicreader.github.io/fossick/cli.html#read_yt) fetches the auto-generated English captions as plain text, disk-cached by video ID. [`download_yt()`](https://vedicreader.github.io/fossick/core.html#download_yt) saves audio or video to disk. ------------------------------------------------------------------------ source ### search_yt ``` python def search_yt( q:str, n:int=10 )->L: ``` *Search YouTube; returns L of dicts: id, title, url, duration, view_count, channel, description, thumbnail* ``` python results = search_yt('3blue1brown neural networks', n=3) assert isinstance(results, L), f"expected L, got {type(results)}" assert len(results) >= 1, f"expected results, got {len(results)}" assert any(kw in results[0]['title'].lower() for kw in ('3blue1brown', 'neural', 'network')), \ f"unexpected title: {results[0]['title']}" assert results[0]['url'].startswith('https://www.youtube.com'), f"bad url: {results[0]['url']}" print(results[0]['title'], '|', results[0]['url']) ``` But what is a neural network? | Deep learning chapter 1 | https://www.youtube.com/watch?v=aircAruvnKk ------------------------------------------------------------------------ source ### read_yt ``` python def read_yt( url:str, force:bool=False )->dict: ``` *Fetch YouTube metadata + English transcript (auto-captions); result disk-cached by video ID* ``` python meta = read_yt('https://www.youtube.com/watch?v=aircAruvnKk') assert meta['title'], "title should be non-empty" assert isinstance(meta['source'], str), "source should be a string" assert len(meta['source']) > 100, f"transcript too short: {len(meta['source'])} chars" assert '3blue1brown' in meta['channel'].lower(), f"unexpected channel: {meta['channel']}" print(f"title: {meta['title']}") print(f"transcript preview: {meta['source'][:200]}") ``` title: But what is a neural network? | Deep learning chapter 1 transcript preview: [Music] This is a three. It's sloppily written and rendered at an extremely low resolution of 28x 28 pixels. But your brain has no trouble recognizing it as a three. And I want you to take a moment to ------------------------------------------------------------------------ source ### download_yt ``` python def download_yt( url:str, format:str='audio', save_dir:str='.', quality:str=None )->Path: ``` *Download YouTube media; format=‘audio’|‘video’|yt-dlp format string. Returns Path to saved file.* ``` python p = download_yt('https://www.youtube.com/watch?v=aircAruvnKk', format='audio', save_dir='/tmp/fossick_test') assert p.exists(), f"file not found: {p}" assert p.suffix == '.mp3', f"expected .mp3, got {p.suffix}" print(f"saved to: {p} ({p.stat().st_size // 1024} KB)") ``` WARNING: [youtube] [jsc] Remote components challenge solver script (deno) and NPM package (deno) were skipped. These may be required to solve JS challenges. You can enable these downloads with --remote-components ejs:github (recommended) or --remote-components ejs:npm , respectively. For more information and alternatives, refer to https://github.com/yt-dlp/yt-dlp/wiki/EJS WARNING: [youtube] aircAruvnKk: n challenge solving failed: Some formats may be missing. Ensure you have a supported JavaScript runtime and challenge solver script distribution installed. Review any warnings presented before this message. For more details, refer to https://github.com/yt-dlp/yt-dlp/wiki/EJS saved to: /tmp/fossick_test/But what is a neural network？｜ Deep learning chapter 1.mp3 (26250 KB) ## Install ------------------------------------------------------------------------ source ### mv_skill_md ``` python def mv_skill_md( dry_run:bool=True, dir:NoneType=None )->None: ``` *Copy bundled SKILL.md to skill directories.* ------------------------------------------------------------------------ source ### repo_root ``` python def repo_root( )->Path: ``` *Find the root of the current git repository, or None if not in a repo.* ``` python root = repo_root() assert root is not None and (root/'.git').exists(), f"Expected git root, got {root}" mv_skill_md(dry_run=True) ```