Download atricles from telnet://hulabear.twbbs.org
Python 2.7
python run.py -a [hula account] -p [hula password] -b [board] -s [starting article number] -e [ending article number]
For example, if you want to download No.2 ~ No.5 articles in board Cs11 (note that the board name is case-sensitive):
python run.py -a account -p password -b Cs11 -s 2 -e 5
Or if you only want to download article No.30:
python run.py -a account -p password -b Cs11 -s 30 -e 30
A Folder download_[board]
will be automatically created under the root folder.
For example, if you download articles from Cs11, the articles will be in:
hulabear_downloader\download_Cs11
In config.ini, there are some configs you can try:
[host]
timeout = 6
[data]
page_splitter =
[encode]
file_name = big5 | utf8
- If your connection to hulabear is too slow, you can increase the
timeout
limit. - If you want to know the range of a page within an article, you can change
page_splitter
to--
- The file name can be encoded to either big5 or utf-8.
- Save each article as a single text file.
- Remove BBS color code.
- Auto-skiping board opening page.
- Skip inaccessible F(friend only) or L(locked) articles.
- Delete duplicated account login.
Waiting for your pull request to fix these issues :)
- Do not support most of the BBS control code.
- Lines hit the end of the page will appear twice. (because hulabear copies the ending line of a page to the next page as a begining line)
- Unexpected spaces in article (due to BBS control code).
The reformat part I reference the craler by geniusturtle