Does anyone have a local copy of zompist.com?

Topics that can go away
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

Just curious because I'm currently making one by copying the source (copy as in Ctrl+c Ctrl+v) into HTML files, as well as downloading the required images.
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
mocha
Posts: 55
Joined: Sun Mar 20, 2022 8:23 am
Location: Eremor, Pankair, Oneia
Contact:

Re: Does anyone have a local copy of zompist.com?

Post by mocha »

WarpedWartWars wrote: Tue Mar 22, 2022 7:41 pm I'm currently making one by copying the source (copy as in Ctrl+c Ctrl+v) into HTML files, as well as downloading the required images.
On Firefox, for single pages at least, it would probably be simpler to just do CTRL+S and select "Web page, complete". This should automatically download all the files used by a particular page.

If you want to download an entire site (ie., hundreds of pages), it would probably be faster to just use a web scraper.
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

mocha wrote: Tue Mar 29, 2022 11:38 pm If you want to download an entire site (ie., hundreds of pages), it would probably be faster to just use a web scraper.
Not actually sure what that is.
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
Raphael
Posts: 4568
Joined: Sun Jul 22, 2018 6:36 am

Re: Does anyone have a local copy of zompist.com?

Post by Raphael »

I used to have various copies of it done by web scrapers at various points in the 2000s and 2010s, but I think they're all gone now.
mocha
Posts: 55
Joined: Sun Mar 20, 2022 8:23 am
Location: Eremor, Pankair, Oneia
Contact:

Re: Does anyone have a local copy of zompist.com?

Post by mocha »

WarpedWartWars wrote: Wed Mar 30, 2022 11:59 pm
mocha wrote: Tue Mar 29, 2022 11:38 pm If you want to download an entire site (ie., hundreds of pages), it would probably be faster to just use a web scraper.
Not actually sure what that is.
It's a program that basically does this:

(1) Download webpage
(2) Look for links on webpage
(3) Goto (1) with new webpage(s)

Fairly simple to code and allows you to download a website fairly quickly (at least, all linked pages...)
User avatar
alice
Posts: 962
Joined: Mon Jul 09, 2018 11:15 am
Location: 'twixt Survival and Guilt

Re: Does anyone have a local copy of zompist.com?

Post by alice »

mocha wrote: Fri Apr 01, 2022 11:14 am
WarpedWartWars wrote: Wed Mar 30, 2022 11:59 pm
mocha wrote: Tue Mar 29, 2022 11:38 pm If you want to download an entire site (ie., hundreds of pages), it would probably be faster to just use a web scraper.
Not actually sure what that is.
It's a program that basically does this:

(1) Download webpage
(2) Look for links on webpage
(3) Goto (1) with new webpage(s)

Fairly simple to code and allows you to download a website fairly quickly (at least, all linked pages...)
But you have to be careful that you don't accidentally download the entire Internet in the process.
Self-referential signatures are for people too boring to come up with more interesting alternatives.
mocha
Posts: 55
Joined: Sun Mar 20, 2022 8:23 am
Location: Eremor, Pankair, Oneia
Contact:

Re: Does anyone have a local copy of zompist.com?

Post by mocha »

While writing a small webscraping utility as a proof-of-concept, I found I apparently already made one, and already downloaded zompist.com about a year ago, except it's mostly just the pages without css. Oh well.

It's on my github, if you're interested: https://github.com/Mocha2007/mochalib/b ... _domain.py
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

mocha wrote: Fri Apr 01, 2022 11:14 am It's a program that basically does this:

(1) Download webpage
(2) Look for links on webpage
(3) Goto (1) with new webpage(s)

Fairly simple to code and allows you to download a website fairly quickly (at least, all linked pages...)
I'd like to code it in Python, but the thing I'm having trouble with is step 2.

Edit: Got step 2 working; at least for

Code: Select all

<a href="relative/path/with/no/explicit/domain.txt"></a>
, which is what zompist.com uses for zompist.com/... links.

Edit: This is what I have so far:

Code: Select all

import os
from urllib.request import urlopen
import re

def _load(url):
    return urlopen(url).read() #.decode(errors="backslashreplace")

def load(url):
    try:
        return (file := _load(url)).decode()
    except UnicodeDecodeError:
        return file

def get_links(url, page):
    narrowed = []
    for (_, _, link) in re.findall(r"""(href|src)=(?P<quote>['"])(?P<url>.*?)(?P=quote)""",
                                   page, re.IGNORECASE):
        curr = ["https://" + domain(url)]
        if link.startswith("http"):
            if domain(link) == domain(url):
                if "/" in (rest := nondomain(link)):
                    curr.append(rest.split("/"))
                else:
                    curr.append(rest)
        elif "/" in link:
            curr.append(link.split("/"))
        else:
            curr.append(link)
        narrowed.append("/".join(curr))
    return narrowed

def save_file(path, file):
    print("saving '" + path + "'...")
    if "/" in path:
        os.makedirs("/".join(path.split("/")[:-1]))
    with open(path, "w" + isinstance(file, bytes) * "b") as f:
        f.write(file)

def dhelp(url):
    return (url.lstrip("qwertyuiopasdfghjklzxcvbnm").lstrip(":/")
                if "://" in url else url).split("/")

def domain(url):
    return dhelp(url)[0]

def nondomain(url):
    return ("/".join(dhelp(url)[1:]) if len(dhelp(url)) else "")

def _main(url):
    global done
    if url in done:
        return
    done.append(url)
    page = load(url)
    save_file(nondomain(url), page)
    for link in get_links(url, page):
        _main(link)

def main(url):
    global done
    done = []
    os.mkdir(domain(url))
    os.chdir(domain(url))
    _main((url + "index.html") if not nondomain(url) else url)
but:

Code: Select all

>>> main("https://zompist.com/")
saving 'index.html'...
Traceback (most recent call last):
  File "<pyshell#32>", line 1, in <module>
    main("https://zompist.com/")
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 65, in main
    _main((url + "index.html") if not nondomain(url) else url)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 56, in _main
    save_file(nondomain(url), page)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 37, in save_file
    f.write(file)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0263' in position 4166: character maps to <undefined>
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

[offtopic]
More: show
BTW who made this Wikipedia article?
[/offtopic]
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
mocha
Posts: 55
Joined: Sun Mar 20, 2022 8:23 am
Location: Eremor, Pankair, Oneia
Contact:

Re: Does anyone have a local copy of zompist.com?

Post by mocha »

WarpedWartWars wrote: Wed Apr 06, 2022 9:17 pm but:

Code: Select all

UnicodeEncodeError: 'charmap' codec can't encode character '\u0263' in position 4166: character maps to <undefined>
Because I'm lazy my solution to this was to simply ignore the errors:

Code: Select all

file.write(src, encode='utf-8', errors='ignore')
I might be losing random characters here and there, but if I am, I don't see them, which is good enough for me! 8-)
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

mocha wrote: Thu Apr 07, 2022 6:47 pm Because I'm lazy my solution to this was to simply ignore the errors:

Code: Select all

file.write(src, encode='utf-8', errors='ignore')
I might be losing random characters here and there, but if I am, I don't see them, which is good enough for me! 8-)
Now I'm getting

Code: Select all

Traceback (most recent call last):
  File "<pyshell#0>", line 1, in <module>
    main("https://www.zompist.com/")
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 64, in main
    _main((url + "index.html") if not nondomain(url) else url)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 55, in _main
    save_file(nondomain(url), page)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 37, in save_file
    f.write(file, errors="ignore")
TypeError: TextIOWrapper.write() takes no keyword arguments
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
alice
Posts: 962
Joined: Mon Jul 09, 2018 11:15 am
Location: 'twixt Survival and Guilt

Re: Does anyone have a local copy of zompist.com?

Post by alice »

Looks very much like a Python version issue.
Self-referential signatures are for people too boring to come up with more interesting alternatives.
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

alice wrote: Fri Apr 08, 2022 3:13 am Looks very much like a Python version issue.
I'm using 3.10.2. What might it be that's causing it?
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
alice
Posts: 962
Joined: Mon Jul 09, 2018 11:15 am
Location: 'twixt Survival and Guilt

Re: Does anyone have a local copy of zompist.com?

Post by alice »

WarpedWartWars wrote: Fri Apr 08, 2022 3:41 am
alice wrote: Fri Apr 08, 2022 3:13 am Looks very much like a Python version issue.
I'm using 3.10.2. What might it be that's causing it?
This error message:

Code: Select all

TypeError: TextIOWrapper.write() takes no keyword arguments
suggests that mocha's version of TextIOWrapper is not the same as yours. That's pretty much all I can say.
Self-referential signatures are for people too boring to come up with more interesting alternatives.
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

alice wrote: Fri Apr 08, 2022 1:22 pm
WarpedWartWars wrote: Fri Apr 08, 2022 3:41 am
alice wrote: Fri Apr 08, 2022 3:13 am Looks very much like a Python version issue.
I'm using 3.10.2. What might it be that's causing it?
This error message:

Code: Select all

TypeError: TextIOWrapper.write() takes no keyword arguments
suggests that mocha's version of TextIOWrapper is not the same as yours. That's pretty much all I can say.
Maybe I could put the "errors" kwarg in "open(...)"...
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

Ok. I have this:

Code: Select all

import os
from urllib.request import urlopen
import re

def _load(url):
    return urlopen(url).read() #.decode(errors="backslashreplace")

def load(url):
    try:
        return (file := _load(url)).decode()
    except UnicodeDecodeError:
        return file

def get_links(url, page):
    narrowed = []
    for (_, _, link) in re.findall(r"""(href|src)=(?P<quote>['"])(?P<url>.*?)(?P=quote)""",
                                   page, re.IGNORECASE):
        if not nondomain(link):
            continue
        curr = ["https://" + domain(url)]
        if link.startswith("http"):
            if domain(link) == domain(url):
                if "/" in (rest := nondomain(link)):
                    curr += rest.split("/")
                else:
                    curr += [rest]
        elif "/" in link:
            curr += link.split("/")
        else:
            curr += [link]
        narrowed.append("/".join(curr))
    return narrowed

def save_file(path, file):
    print("saving '" + path + "'...")
    if "/" in path:
        os.makedirs("/".join(path.split("/")[:-1]))
    with (open(path, "wb")
          if isinstance(file, bytes)
          else open(path, "w", errors="ignore")) as f:
        f.write(file)

def dhelp(url):
    return (url.lstrip("qwertyuiopasdfghjklzxcvbnm").lstrip(":/")
                if "://" in url else url).split("/")

def domain(url):
    return dhelp(url)[0]

def nondomain(url):
    return ("/".join(dhelp(url)[1:]) if len(dhelp(url)) else "")

def _main(url):
    global done
    if url in done:
        return
    done.append(url)
    print(url)
    page = load(url)
    save_file(nondomain(url), page)
    if not isinstance(page, bytes):
        for link in get_links(url, page):
            if nondomain(link):
                _main(link)

def main(url):
    global done
    done = []
    os.mkdir(domain(url))
    os.chdir(domain(url))
    _main((url + "index.html") if not nondomain(url) else url)
and it works for some things, but not "../stuff":

Code: Select all

>>> main("https://zompist.com/")
https://zompist.com/index.html
saving 'index.html'...
https://zompist.com/illo/zbblogo.gif
saving 'illo/zbblogo.gif'...
https://zompist.com/mars/index.html
saving 'mars/index.html'...
https://zompist.com/../incatena.html
Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    main("https://zompist.com/")
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 71, in main
    _main((url + "index.html") if not nondomain(url) else url)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 64, in _main
    _main(link)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 64, in _main
    _main(link)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 59, in _main
    page = load(url)
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 10, in load
    return (file := _load(url)).decode()
  File "C:\Users\<user>\Desktop\py\webscrape.py", line 6, in _load
    return urlopen(url).read() #.decode(errors="backslashreplace")
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 525, in open
    response = meth(req, response)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 634, in http_response
    response = self.parent.error(
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 563, in error
    return self._call_chain(*args)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
    result = func(*args)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request
How do I fix that?
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

Update: I updated the code recently and it at least mostly works now:

Code: Select all

import os
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from http.client import InvalidURL
import sys
import re

def istextlike(url):
    return any(url.lower().endswith(ext) for ext in
               (".txt",".md",".html",".htm",".shtml"))

def _load(url):
    return urlopen(url).read() #.decode(errors="backslashreplace")

def load(url):
    c = _load(url)
    if istextlike(url):
        c = re.sub(r"\\x([0-9a-f]{2})",lambda m:chr(int(m[1],16)),
                   c.decode(errors='backslashreplace'))
    return c

def get_links(url, page):
    narrowed = []
    for match in re.findall(r"""(?:
                                 (?:href|src)=
                                 (?P<quote1>['"])
                                  (?P<url1>.*?)
                                 (?P=quote1)
                                )|
                                (?:
                                 (?:url|open)\(
                                  (?P<quote2>['"])
                                   (?P<url2>.*?)
                                  (?P=quote2)
                                 \)
                                )|
                                (?:
                                 name\s*=\s*
                                 (?P<quote3>['"])
                                  (?P<url3>.*?)
                                 (?P=quote3)
                                )""",
                                   page, re.IGNORECASE | re.VERBOSE):
        link = match[1] or match[3] or match[5]
        if ("#" in link or
            "?" in link):
            continue
        relurl = nondomain(url).split("/")[:-1]
        curr = ["http://"+domain(url)]
        while link.startswith("."):
            relurl = relurl[:-1]
            link = "/".join(link.split("/")[1:])
        curr += relurl
        if link.startswith("http"):
            if domain(link) == domain(url):
                curr = [curr[0]] + nondomain(link).split("/")
            else:
                continue
        elif ("://" in link or
              link.startswith("mailto:")):
            continue
        elif "/" in link:
            if link.startswith("/"):
                curr = [curr[0]]
                link = link[1:]
            curr += link.split("/")
        else:
            curr += [link]
        narrowed.append("/".join(curr))
    return narrowed

def save_file(path, file):
    path = path.strip()
    if path[-1] == "/":
        path += "index.html"
    print("saving '" + path + "'...")
    if "/" in path:
        os.makedirs("/".join(path.split("/")[:-1]), exist_ok=True)
    try:
        with (open(path, "wb")) as f:
              #if isinstance(file, bytes)
              #else open(path, "w", errors="ignore")) as f:
            f.write(file if isinstance(file, bytes)
                    else file.encode())
    except PermissionError:
        pass

def dhelp(url):
    return (url.lstrip("qwertyuiopasdfghjklzxcvbnm").lstrip(":/")
                if "://" in url else #"./"+
                                     url).split("/")

def domain(url):
    return dhelp(url)[0]

def nondomain(url):
    return ("/".join(dhelp(url)[1:]) if len(dhelp(url)) else "")

def _main(todo, done):
    url = todo.pop(0)
    if url in done:
        return
    done.append(url)
    print(url)
    try:
        page = load(url)
    except HTTPError as err:
        if err.status == 404:
            print("404: " + url, file=sys.stderr)
            return
        raise
    except URLError as err:
        if True:#err.errno == -3:
            print("Failed, no internet probably")
        return
    except InvalidURL as err:
        print(err, file=sys.stderr)
        return
    save_file(nondomain(url), page)
    if isinstance(page, bytes):
        return
    for link in get_links(url, page):
        if nondomain(link):
            todo.append(link)#_main(link)

def main(url):
    done = []
    todo = [url]
    os.makedirs(domain(url).replace(":", "_"), exist_ok=True)
    os.chdir(domain(url).replace(":", "_"))
    try:
        while len(todo)>0:
            _main(todo, done) #(url + "/index.html") if not nondomain(url) else url)
    finally:
        os.chdir("../")
It is by no means perfect, but it's working well enough for my purposes, for now.
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
User avatar
WarpedWartWars
Posts: 197
Joined: Sat Aug 28, 2021 2:31 pm
Location: tɑ tɑ θiθɾ eɾloθ tɑ moew θerts

Re: Does anyone have a local copy of zompist.com?

Post by WarpedWartWars »

Update 2 for today: it's much improved and weird things like "urls" like "blah" are ignored (no dot), and it also uses already-downloaded files if possible:

Code: Select all

import os
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from http.client import InvalidURL
import sys
import re

def istextlike(url):
    return any(url.lower().endswith(ext) for ext in
               (".txt",".md",".html",".htm",".shtml"))

def _load(url):
    try:
        return urlopen(url).read() #.decode(errors="backslashreplace")
    except UnicodeEncodeError:
        raise URLError("invalid char in url")

def load(url):
    c = _load(url)
    if istextlike(url):
        c = re.sub(r"\\x([0-9a-f]{2})",lambda m:chr(int(m[1],16)),
                   c.decode(errors='backslashreplace'))
    return c

def loadlocal(url, file):
    c = file.read()
    if istextlike(url):
        c = re.sub(r"\\x([0-9a-f]{2})",lambda m:chr(int(m[1],16)),
                   c.decode(errors='backslashreplace'))
    return c

def get_links(url, page):
    narrowed = []
    for match in re.findall(r"""(?:
                                 (?:href|src)=
                                 (?P<quote1>['"])
                                  (?P<url1>.*?)
                                 (?P=quote1)
                                )|
                                (?:
                                 (?:url|open)\(
                                  (?P<quote2>['"])
                                   (?P<url2>.*?)
                                  (?P=quote2)
                                 \)
                                )|
                                (?:
                                 name\s*=\s*
                                 (?P<quote3>['"])
                                  (?P<url3>.*?)
                                 (?P=quote3)
                                )""",
                                   page, re.IGNORECASE | re.VERBOSE):
        link = match[1] or match[3] or match[5]
        if ("#" in link or
            "?" in link):
            continue
        relurl = nondomain(url).split("/")[:-1]
        curr = ["http://"+domain(url)]
        while link.startswith("."):
            relurl = relurl[:-1]
            link = "/".join(link.split("/")[1:])
        curr += relurl
        if link.startswith("http"):
            if domain(link) == domain(url):
                curr = [curr[0]] + nondomain(link).split("/")
            else:
                continue
        elif "." not in link:
            continue
        elif ("://" in link or
              link.startswith("mailto:")):
            continue
        elif "/" in link:
            if link.startswith("/"):
                curr = [curr[0]]
                link = link[1:]
            curr += link.split("/")
        else:
            curr += [link]
        narrowed.append("/".join(curr))
    return narrowed

def save_file(path, file):
    path = sanitizefilename(path)
    print("saving '" + path + "'...")
    if "/" in path:
        os.makedirs("/".join(path.split("/")[:-1]), exist_ok=True)
    try:
        with (open(path, "wb")) as f:
              #if isinstance(file, bytes)
              #else open(path, "w", errors="ignore")) as f:
            f.write(file if isinstance(file, bytes)
                    else file.encode())
    except PermissionError:
        pass

def sanitizefilename(path):
    path = path.strip()
    if path[-1] == "/":
        path += "index.html"
    return path.replace(":", "_")

def dhelp(url):
    return (url.lstrip("qwertyuiopasdfghjklzxcvbnm").lstrip(":/")
                if "://" in url else #"./"+
                                     url).split("/")

def domain(url):
    return dhelp(url)[0]

def nondomain(url):
    return ("/".join(dhelp(url)[1:]) if len(dhelp(url)) else "")

def _main(todo, done):
    url = todo.pop(0)
    if url in done:
        return
    done.append(url)
    print(url)
    if os.path.exists(sanitizefilename(nondomain(url))):
        p = open(sanitizefilename(nondomain(url)), "rb")
        page = loadlocal(sanitizefilename(nondomain(url)), p)
    else:
        try:
            page = load(url)
        except HTTPError as err:
            if err.status//100==4:
                print(str(err.status) + ": " + url, file=sys.stderr)
                return
            raise
        except URLError as err:
            if True:#err.errno == -3:
                print("Failed with URLError: " + str(err.reason), file=sys.stderr)
            return
        except InvalidURL as err:
            print(err, file=sys.stderr)
            return
        if ((isinstance(page, str) and "</address>" not in page)
            or isinstance(page, bytes)):
            save_file(nondomain(url), page)
    if isinstance(page, bytes):
        return
    for link in get_links(url, page):
        if nondomain(link):
            todo.append(link)#_main(link)

def main(url):
    done = []
    todo = [url]
    os.makedirs(sanitizefilename(domain(url)), exist_ok=True)
    os.chdir(sanitizefilename(domain(url)))
    try:
        while len(todo)>0:
            _main(todo, done) #(url + "/index.html") if not nondomain(url) else url)
    finally:
        os.chdir("../")
tɑ tɑ tɑ tɑ θiθɾ eɾloθ tɑ moew θerts olɑrk siθe
of of of of death abyss of moew kingdom sand witch-PLURAL
The witches of the desert of the kingdom of Moew of the Abyss of Death

tɑ toɾose koɾot tsɑx
of apple-PLURAL magic cold
cold magic of apples
Torco
Posts: 797
Joined: Fri Jul 13, 2018 9:11 am

Re: Does anyone have a local copy of zompist.com?

Post by Torco »

may i suggest beautifulsoup? manually parsing html is possible, i've done it, but it's almost always simpler to just have someone else's code do it for you.
User avatar
xxx
Posts: 811
Joined: Sun Jul 29, 2018 12:40 pm

Re: Does anyone have a local copy of zompist.com?

Post by xxx »

why bother...
to make my conlang more memorable and lively, I tend not to archive anything...
reinventing the wheel every day makes it possible to have a conlangue as a second language,
even if it's a particular language that you don't speak much and wouldn't understand orally...
Post Reply