A series of data collecting & processing scripts for performing a set of analyses upon the various games of the NC lottery.
The written analysis is located elsewhere (yet to be formalized). Once complete, it will be linked here.
Many of the games we've analyzed feature a similar structure: the user picks a series of numbers, and the dealer picks a series of numbers; the prize is then dependent upon how many of the user's numbers match the dealer's. An important note: typically, the order of the numbers is irrelevant.
Using the aforesaid information, we developed a relatively clever system to quickly
analyze, and thereupon match, a given number string: encode each number into a n
-bit
integer, wherein each n
-th bit encodes the presence of a number of value n
.
For example, in some arbitrary game:
dealer_numbers = "1, 2, 3, 4"
user_numbers = "1, 2"
dealer_number_bits = 30; 0b11110
user_numbers_bits = 6; 0b110
It's trivial then to calculate how many numbers matched
(popcount
of the bitwise-AND of the
two numbers), and moreover, what numbers lie within a given string (useful for various
analyses done later).
Note that the above is an n
-bit integer: in Python, integers are of an arbitrary
precision, but in most parts of the universe, this isn't true. The solution? Break up
the integer into parts, each of a BIT_COUNT
size (most often, this is 64).
As it stands, cash345 focuses primarily on the collecting, and thereon processing, of data from NC's Cash 5. This includes a scraper for collecting the data, an a series of post-processing scripts for manipulating it down to a usable form.
Perhaps the most interesting script is back_test.py: this runs a "what-if" simulation of the following: what if you were to play the same number combination every day, starting from some arbitrary point in time (clamped between the start of NC's Cash 5 game and now)? How much would you stand to lose, or to win?
An example:
cash5_path = "cash345/data/cash5_winnings_1.csv"
cash5_df = pd.read_csv(cash5_path)
nums = "1, 2, 3, 4, 5"
date = "10/08/2007"
winnings = back_test(nums, cash5_df, date)
Which would project your winnings, playing only 1, 2, 3, 4, 5
, starting on
10/08/2007
.
Using information collected from the NC state lottery, herein we process and analyze a
group of ~5M wager records of NC's Keno variant. We use
our usual post-processing tricks here: convert the "user" numbers (wagers
), and the
"dealer" numbers (drawings
) into their integral representations, & c. & c.
After processing, we store the various facets of the data into a SQLite database (though the data medium is relatively inconsequential).
A user can select to play a given number string for up to 20 dealer drawings. In place
of having then 20 rows for each repeated number, the wager data encodes this into one
row with two values demarcating the start and end wager ids (allowing the wager to be
tied back to a unique purchase). This is certainly more efficient, but to make the prize
calculations easier, we explode out these rows: so if a given row has a range from
1-n
, we'd turn this into n
rows with a new wager_id
.
A rather substantive optimization can be made during the processing of the wager data:
as the order of the numbers is irrelevant, and only numbers within a limited range
([1, 80]
) can be played, the wager data for an individual user is probabilistically
similar, with the potential for the played numbers to be duplicated an arbitrarily large
number of times; the only inextricable values are then the date
, wager_id
, etc.
Therefore, the numbers played, and the integral versions thereof - which consume the majority of the row space - may be compressed using an intermediary foreign-key table. The actual numbers played are separated out, leaving only a integer pointer into the aforesaid table.