Description:
There are some fun exercise done, and those are Python.
Table Of Content:
[TOC]
1. ImgToChar
Convert the image pixel information to char.
├── imgToChar.py
├── result
└── test
└── dora.png
There are two directors, one is result
that stores the result, the other is test
that stores the test image file.
Run the python script file imgToChar.py
directly, like python imgToChar.py -f test/dora.png --width 80 --height 80
.
Crawl the user's repo information, like repo name and latest update time. Run the crawl project, likes:
cd CrawlWeb
scrapy crawl github
Besides, there is some interactive information.
There are some packages required:
- scrapy
- time
- pandas
3. movieCrawl
Crawl the DouBan Top250 movie, you can run the script, like python douban.py
.
4. loggerConfig
Use configuration file to setup logger. There are some default FORMATTERS and HANDLERS configuration conf
file that can be parsed with HOCON .
There are some packages required:
- pyhocon
There is a example:
import logging
import logging.config
conf = create_logger_conf(log_path="log.log")
logger_conf = {
"loggerss": {
"sample": {
"handlers": ["console", "file"],
"level": "INFO"
}
}
}
conf.update(logger_conf)
logging.config.dictConfig(conf)
# now can set logger `sample`
sample = logging.getLogger("sample")
sample.info("This is test message")
It’s a customize script(visualize) to display pytorch training progress, which is used the package visdom
.
Caution: it does plot line.
TODO:
- Line plot
- Single Line plot
- Multi-line plot
- Text plot
- Image
- Video
Right now, it can plot one single line, multi-line on one figure, multi-line on different figure. When plot train and validate result, the train value is forehead.
# import package
from utils import visualize
import numpy as np
# Initialization
vis = visualize.Visualizer2(report_format="{time}:{epoch}")
# plot train and validate result
for _ in range(4):
data = {
"accuracy": [np.random.randint(-20, 3000), np.random.randint(-20, 3000)],
"f1": [np.random.randint(-200, 150), np.random.randint(-200, 230)]
}
vis.multi_plot(data)
# plot single result with plot method
for _ in range(4):
data = np.random.randint(-5, 10)
vis.plot(data, "accuracy")
# plot single result with multi_plot method
for _ in range(4):
data = {"accuracy": np.random.randint(-5, 10)}
vis.multi_plot(data)
# plot more result with multi_plot method, like accuracy and f1
for _ in range(4):
data = {
"accuracy": np.random.randint(-5, 5),
"f1": np.random.randint(-5, 4)
}
vis.multi_plot(data)
Multi-threading example, there are two thread: one is used to extract data item ExtractorThread
, the other is predict sentiment item PredictThread
. The thread class example:
class NamedThread(threading.Thread):
def __init__(self, filename, mode, **kwargs):
"""Initalize Property"""
self.filename = filename
self.mode = mode
super().__init__(**kwargs)
def deal_method(self):
"""Deal With Data Item Method"""
global ITEM_QUEUE, RETRY_TIME
retry = 0 # retry times
logger.info(f"Start predict thread")
while retry <= RETRY_TIME:
try:
item = ITEM_QUEUE.get(timeout=3)
text = item["clean_text"]
tokens = model.prepross(text)
item["predict"] = model.predict(tokens) if len(tokens) else (1, 0, 0, 0)
# write data
fhandler = FileWriter(self.filename, mode=self.mode)
with fhandler as writer:
writer.write(json.dumps(item, ensure_ascii=False) + "\n")
logger.debug(f"Predict item id is {item['id']}")
# if get item right, re_initial retry
retry = 0
except queue.Empty:
logger.debug(f"Queue is empty, wait 1 seconds.")
retry -= 1
time.sleep(1)
# exhausted RETRY_TIME
logger.info(f"Data item is exhausted")
def run(self):
"""Overide Method From Thread"""
self.predict()
5. RegexNLP
Compress word information about Weibo. Use regular expression to sub unified token, like:
BOS = "<BOS/>"
EOS = "</EOS>"
UNK = "<UNK>"
DATE = "#DATE"
TIME = "#TIME"
NUMBER = "#NUM"
LINK = "#URL"
IMAG = "#IMG"
USER = "#USER"
FEE = "#FEE"
PHONE = "#PHONE"
Caution
There are two functions to deal with string:
# string to unicode
>>> string = "这个是一个测试字符串"
>>> text2unicode(string) # 默认情况是直接得到 16 进制字符串
'\u8FD9\u4E2A\u662F\u4E00\u4E2A\u6D4B\u8BD5\u5B57\u7B26\u4E32'
>>> print("\u8FD9\u4E2A\u662F\u4E00\u4E2A\u6D4B\u8BD5\u5B57\u7B26\u4E32")
这个是一个测试字符串
>>> print(text2unicode(string), True) # 转换为字节型字符
b'\\u8fd9\\u4e2a\\u662f\\u4e00\\u4e2a\\u6d4b\\u8bd5\\u5b57\\u7b26\\u4e32'
# unicode to string
>>> string = b'\\u8fd9\\u4e2a\\u662f\\u4e00\\u4e2a\\u6d4b\\u8bd5\\u5b57\\u7b26\\u4e32'
>>> unicode2text(string) # 字节型数据转换
'这个是一个测试字符串'
>>> string = '\\u8fd9\\u4e2a\\u662f\\u4e00\\u4e2a\\u6d4b\\u8bd5\\u5b57\\u7b26\\u4e32'
>>> unicode2text(string) # 字符串型数据转换
'这个是一个测试字符串'
7. Naive RNN
使用 Numpy 实现简单的 RNN cell,包含了三个属性: weight_io
是输入到输出权重 weight_ho
是上一个输出
8. ConfigParse
Python 的配置解析,可以分为两种情况:1. 解决 CMD 的方式交互的配置解析,这种主要是在命令行的情况下使用;2. 解决从文件中解析的情况,这种情况主要的目的是从一个配置文件中去解析。
第一种情况下可以使用标准库中的 argparse 解决,第二种的话可以根据情况判断是否需要使用第三方库,第三方库包括 pyhocon(是由谷歌开发完成,集成了多个类型的配置方式,yaml、init、json 等多种类型)。如果不使用第三方库,可以使用标准库中的 configparser 解决。
configparser 解析的文件是与 Microsoft Windows INI 文件的类似:
[DEFAULT]
ServerAliveInterval = 45
Compression = yes
CompressionLevel = 9
ForwardX11 = yes
[bitbucket.org]
User = hg
[topsecret.server.com]
Port = 50022
ForwardX11 = no
configparser
可以直接解析文件,通过一个 parser
对象可以直接使用 read
等几种方法读取文件或者数据可以达到解析数据的情况。在文件 parse.py
是通过 read
方法读取文件,脚本的数据方式是:
>>> from parse import Config
>>> config = Config("./database.ini")
# 获取 mysql 的配置信息
>>> config.mysql
{'host': '192.168.1.31',
'port': '3306',
'user': 'root',
'charset': 'utf8mb4',
'connect_timeout': '35',
'database': 'test'}
机器学习部分的有趣代码整理和总结
简单线性回归参数变化,方差和残差变化可视化。详情参考 README
10. AutoEncoder 图像应用
通过 Keras 搭建 encoder 和 decoder 模型的方式,将 encoder 的结果作为特征表现的方式