Python module implementing Stable Random Projections, as described in Cultural Analytics Vol. 1, Issue 2, 2018 October 04, 2018: Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries
These create interchangeable, data-agnostic vectorized representations of text suitable for a variety of contexts. Unlike most vectorizations, they are suitable for representing texts in any language that uses space-tokenization, or non-linguistic content, since they contain no implicit language model besides words.
You may want to use them in concert with the pre-distributed Hathi SRP features described further here.
Requires python 3
pip install pysrp
Version 2.0 (July 2022) slightly changes the default tokenization algorithm!
Previously it was \w
; now it is [\p{L}\p{Pc}\p{N}\p{M}]+
.
I also no longer recommend use
See the docs folder for some IPython notebooks demonstrating:
- Taking a subset of the full Hathi collection (100,000 works of fiction) based on identifiers, and exploring the major clusters within fiction.
- Creating a new SRP representation of text files and plotting dimensionality reductions of them by language and time
- Searching for copies of one set of books in the full HathiTrust collection, and using Hathi metadata to identify duplicates and find errors in local item descriptions.
- Training a classifier based on library metadata using TensorFlow, and then applyinig that classification to other sorts of text.
Use the SRP class to build an object to perform transformations.
This is a class method, rather than a function, which builds a cache of previously seen words.
import srp
# initialize with desired number of dimensions
hasher = srp.SRP(640)
The most important method is 'stable_transform'.
This can tokenize and then compute the SRP.
hasher.stable_transform(words = "foo bar bar"))
If counts are already computed, word and count vectors can be passed separately.
hasher.stable_transform(words = ["foo","bar"],counts = [1,2])
SRP files are stored in a binary file format to save space. This format is the same used by the binary word2vec format.
DEPRECATION NOTICE
This format is now deprecated--I recommend the Apache Arrow binary serialization format instead.
file = SRP.Vector_file("hathivectors.bin")
for (key, vector) in file:
pass
# 'key' is a unique identifier for a document in a corpus
# 'vector' is a `numpy.array` of type `<f4`.
There are two other methods. One lets you read an entire matrix in at once. This may require lots of memory. It returns a dictionary with two keys: 'matrix' (a numpy array) and 'names' (the row names).
all = SRP.Vector_file("hathivectors.bin").to_matrix()
all['matrix'][:5]
all['names'][:5]
The other lets you treat the file as a dictionary of keys. The first lookup may take a very long time; subsequent lookups will be fast without requiring you to load the vectors into memory. To get a 1-dimensional representation of a book:
all = SRP.Vector_file("hathivectors.bin")
all['gri.ark:/13960/t3032jj3n']
You can also, thanks to Peter Organisciak, access multiple vectors at once this way by passing a list of identifiers. This returns a matrix with shape (2, 160) for a 160-dimensional representation.
all[['gri.ark:/13960/t3032jj3n', 'hvd.1234532123']]
You can build your own files row by row.
# Note--the dimensions of the file and the hasher should be equal.
output = SRP.Vector_file("new_vectors.bin",dims=640,mode="w")
hasher = SRP.SRP(640)
for filename in [a,b,c,d]:
hash = hasher.stable_transform(" ".join(open(filename).readlines()))
output.add_row(filename,hash)
# files must be closed.
output.close()
Since files must be closed, it can be easier to use a context manager:
# Note--the dimensions of the file and the hasher should be equal.
hasher = SRP.SRP(640)
with SRP.Vector_file("new_vectors.bin",dims=640,mode="w") as output:
for filename in [a,b,c,d]:
hash = hasher.stable_transform(" ".join(open(filename).readlines()))
output.add_row(filename,hash)