touhf / build-your-own-dataset-generator

Tutorial on building dataset generator using Python programming language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

The goal of this tutorial is to teach you how to build your own CLI (command-line interface) data set generator using Python programming language.

This program will be generating dataset as CSV files (text files with information separated by commas, saved with the extension .csv). You can add functionality to generate datasets in other formats later as an exercise.

If you are impatient or prefer to study code on your own, not following tutorials, here is the link to source code - https://github.com/touhf/Dataset_Generator It's only about 400 lines of code.

Creating virtual environment

For this program you will have to install few pip modules so it's recommended for you to create virtual environment.

To create virtual environment, type following command in your terminal.

>>> python -m venv /path/to/new/virtual/environment

Activating virtual environment.

>>> source /path/to/new/virtual/environment/bin/activate

So, when you will be installing pip modules for this project, don't forget to activate created virtual environment first, otherwise you will install modules globally.

Also if you installing all the modules into virtual environment, to run your program you have to activate virtual environment too, otherwise python will check for modules globally and will throw an error since it won't find them.

If you want to read more about virtual environments in Python, here is the link to official documentation - https://docs.python.org/3/library/venv.html

Preparations

Usually, programs consist of data structures, predefined values, functions that operate them when the program works and a part of code that is calling all the functions to get some data from user, changes it, generates new and gives desired output.

We will start with preparing data structures first, then create helper functions for data generation and finally move to the part where all of it works together.

Create Python script "main.py", this is where the main code will be.

Our dataset will consist of table header and rows of data, let's start with creating them.

class Header:
    columns = []  # list of Column objects

    def __init__(self):
        pass

    def add_column(self, col: Column):
        self.columns.append(col)

    def remove_column(self, index: int):
        del self.columns[index]

columns is a list of Column objects which contain name of the column and data type to be generated. We will create Column class a bit later.

So far we just added functions to add new Column object to Header and remove, since it will be interactive generator user will be entering commands while program is running, such as "add new column", then specifying it's name and choosing type, or removing them by id.

After specifying all Header columns user will have to enter specific command to generate data, program will ask to enter file name for dataset and how many rows of data to generate, and then will generate file with ".csv" extension of specified name with randomly generated data of specified by user types.

class DataRow:
    cells = []  # row of data

    def __init__(self):
        pass

DataRow object will be used to generate row of data, write it down to CSV file, then clear cells and generate next row of data. Rows will look like this:

age, name, city,
20, Sharon Evans, Helsinki,
49, Mary Carey, Buluan,
47, Deanna Guerra, Dhampur,

This is example of data generated by program, first column has name "age" and type "random_number" specified from 0 to 100, second "name" has type "random_name", third is a "random_city" of any country.

Let's create Column class now.

class Column:
    def __init__(self, name, col_type):
        self.name = name
        self.col_type = col_type

	def __str__(self):
        return "{0}, ".format(self.name)

Function that will be generating data, will be checking col_type of each Column object in header and generate data accordingly.

Now, we will add all column types for the program.

from enum import Enum
# available column types for generation
class ColumnType(Enum):
	NAME          = 1
    PHONE_NUMBER  = 2
    EMAIL         = 3
    RANDOM_NUMBER = 4
    CITY          = 5
    COUNTRY       = 6
    PASSWORD      = 7

Some of the column types will require additional input to specify parameters, for example lower and upper borders for RANDOM_NUMBER, or country for CITY type.

After finishing this tutorial feel free to add your own functionality as an exercise.

# print all column types
def list_column_types():
    print("Available column types:")
    for col_type in ColumnType:
        print("\t{0}) {1}".format(col_type.value, col_type.name))

This function will be used for displaying all available data types for user.

During the generation of data, two of those data types require additional parameters, RANDOM_NUMBER needs lower and higher borders, and CITY needs country.

To not add additional if-statements in generator function because of the two types that require additional input, we will add two parameters to Column class for every object.

class Column:
    def __init__(self, name, col_type):

        #...

        # borders for RANDOM_NUMBER
        # lo also used for specifying country for CITY type
        self.lo = 0
        self.hi = 0

Add __str__ function to DataRow class to append whole rows of data to CSV file during generation.

class DataRow:

	# ...

    def __str__(self):
        str_row = ""

        for cell in self.cells:
            str_row += "{0}, ".format(cell)
        str_row += "\n"

        return str_row

Now, let's create functions for generating random data. Create "generators.py" file to not make main file too big.

For generating random human names we will use "names" module. Activate virtual environment you created earlier if it's not yet activated and install module using "pip install names" in terminal, then import it into "generators.py" file.

import names

Add function for generating random name.

def get_random_name(arg1, arg2):
    return names.get_full_name()

To generate random number we will import "random". You don't have to install it using pip because it's part of a python's standard library.

import random

Create function for generating random phone number.

def get_random_phone_number(arg1, arg2):
    first = str(random.randint(100, 999))
    second = str(random.randint(1, 888)).zfill(3)

    last = str(random.randint(1, 9998)).zfill(4)
    while last in ["1111", "2222", "3333", "4444", "5555", "6666", "7777", "8888"]:
        last = str(random.randint(1, 9998)).zfill(4)

    return "{}-{}-{}".format(first, second, last)

zfill function adds zeroes to the start of string until it's length equal to parameter passed to zfill function.

For better understanding, here is the example of random numbers generated by that function:

723-432-1925
143-003-3234
357-794-5677
712-437-5587
126-061-8502

For generating random email address we will need to import "string", it's a part of standard library too, so no need to install it.

import string

Create function for generating random email.

def get_random_email(arg1, arg2):
    name = "".join(
        random.choice(string.ascii_letters) for i in range(random.randint(6, 12))
    )
    return name + "@gmail.com"

Here we generate random ASCII letters, from 6 to 12 letters so all mails are not of the same size.

Function for generating random number is pretty straightforward and versatile due to lower and higher borders. You can, for example, set borders from 1 to 100 and name column "age", or make bigger borders for generating population of country or whatever data you need to generate.

def get_random_number(low, high):
    return random.randint(low, high)

For generating random city or country name we will need "rcoc" pip module. Install it using pip first, using "pip install rcoc", then import into our script with generators.

import rcoc

Add function to generate random city name.

def get_random_city(arg1, arg2):
    res = rcoc.get_random_city_by_country(arg1)
    if res.strip() == "":
        return rcoc.get_random_city()
    else:
        return res

User can select country to generate random city from. rcoc.get_random_city_by_country(arg1) returns empty string if country wasn't found, so in that case we generate city from any country to not leave cells empty.

Then add function for generating random country.

def get_random_country(arg1, arg2):
    return rcoc.get_random_country()

Add another useful function for interactive interface, user can ask for list of all countries available for generation, this function will be used for that.

def get_all_countries():
    return rcoc.get_all_countries()

It returns all countries in array.

Let's now add function for generating random password.

def get_random_password(arg1, arg2):
    characters = string.ascii_letters + string.digits
    password = "".join(random.choice(characters) for i in range(random.randint(8, 16)))
    return password

Here we generate random ASCII letters and random digits, totally from 8 to 16 signs so that passwords are not of the same size.

Now we done with writing all our generator functions, let's go back to "main.py".

First of all, we need to import out generators into main script.

import generators

Then create an array with all generator functions.

generator_functions = [
    generators.get_random_name,
    generators.get_random_phone_number,
    generators.get_random_email,
    generators.get_random_number,
    generators.get_random_city,
    generators.get_random_country,
    generators.get_random_password,
]

When user will be adding new columns and choosing types of data, program will list all available data types with their numbers (1-7) and ask user to enter id of desired data type, then generator function will call generator_functions[id-1]() function from this array.

So it is important that all these functions are added in array in this exact order.

Remember that indexes in arrays start from 0, this is why we use "id-1" to choose function.

Finally, let's add generator function to DataRow class.

class DataRow:

	# ...

    def generate(self):
        self.cells.clear()
        for col in Header.columns:
            self.cells.append(
                generator_functions[ColumnType[col.col_type].value - 1](col.lo, col.hi)
            )

As already was explained, this function is used to generate one row of data, write it to dataset file (using __str__ function of DataRow class), then clears array, and starts generating next row.

show_header function

For user convenience, let's add function to Header class to show all Columns user defined so far. It will show all columns as list with it's id, type and name.

# print all values in columns list
def show_header(self):
    print("+--------+--------------+--------+")
    print("|   id   |     type     |  name  |")
    print("+--------+--------------+--------+")
    for i in range(len(self.columns)):
        # printing with index in list

Firstly, display nice header for table.

We want to display additional parameters of RANDOM_NUMBER and CITY data types too, for that purpose we add additional if-statements here.

Add the following code inside of "for" loop.

if self.columns[i].col_type == ColumnType.RANDOM_NUMBER.name:
    print(
        "[{0}]\t{1} ({3}-{4})\t{2}".format(
            i,
            self.columns[i].col_type,
            self.columns[i].name,
            self.columns[i].lo,
            self.columns[i].hi,
	        )
        )

This is for displaying RANDOM_NUMBER type. Now, let's add code for displaying CITY type.

elif self.columns[i].col_type == ColumnType.CITY.name:
    print(
        "[{0}]\t{1} ({3})\t{2}".format(
            i,
            self.columns[i].col_type,
            self.columns[i].name,
            self.columns[i].lo,
            )
        )

And, finally for all other types that don't have any additional parameters.

else:
    print(
        "[{0}]\t{1}\t{2}".format(
            i, self.columns[i].col_type, self.columns[i].name
        )
    )

At the end of function call empty "print()" function to add new line.

print()

Here is the full code of show_header function:

# print all values in columns list
def show_header(self):
    print("+--------+--------------+--------+")
    print("|   id   |     type     |  name  |")
    print("+--------+--------------+--------+")
    for i in range(len(self.columns)):
        # printing with index in list
        if self.columns[i].col_type == ColumnType.RANDOM_NUMBER.name:
            print(
                "[{0}]\t{1} ({3}-{4})\t{2}".format(
                    i,
                    self.columns[i].col_type,
                    self.columns[i].name,
                    self.columns[i].lo,
                    self.columns[i].hi,
                )
            )
        elif self.columns[i].col_type == ColumnType.CITY.name:
            print(
                "[{0}]\t{1} ({3})\t{2}".format(
                    i,
                    self.columns[i].col_type,
                    self.columns[i].name,
                    self.columns[i].lo,
                )
            )
        else:
            print(
                "[{0}]\t{1}\t{2}".format(
                    i, self.columns[i].col_type, self.columns[i].name
                )
            )
    print()
changing name and type of columns

Now we will add functions for changing column name and type inside of Header class.

def change_column_name(self, index: int, new_name: str):
    self.columns[index].name = new_name

Changing name is pretty simple, now let's make function for changing type.

def change_column_type(self, index: int, new_type: ColumnType):
    # if new column type is CITY
    if new_type == ColumnType.CITY.name:
        print("enter 'countries' to see list of all countries.")
        lo = input("(specify country/leave empty if any)~> ").strip().title()
        while lo.lower() == "countries":
            print(generators.get_all_countries())
            lo = input("(specify country/leave empty if any)~> ")

        # if country not found in list of countries write warning and exit
        if lo.strip() != "" and lo.strip() not in generators.get_all_countries():
            print("Country not found.")
            return

        # if no input then any country
        if lo == "":
            lo = "any"

        self.columns[index].col_type = new_type
        self.columns[index].lo = lo

This function takes new column type as parameter. If the new type is CITY then we add functionality for user to choose country or leave field empty in case it doesn't matter.

That's not the end of function, now let's add new if statement for RANDOM_NUMBER type.

	# if new column type is RANDOM_NUMBER
	elif new_type == ColumnType.RANDOM_NUMBER.name:
	    try:
	        lo = int(input("(lowest number)~> "))
	        hi = int(input("(highest number)~> "))

	        if lo > hi:
	            print("Incorrect input. First number must be lower than second.")
	        else:
	            self.columns[index].col_type = new_type
	            self.columns[index].lo = lo
	            self.columns[index].hi = hi

	    except:
	        print("Incorrect number.")

It asks user to specify lower and upper borders. Now function is finished.

We need only one header object for generating data, add the following code at the end of script.

header = Header()

Then add functionality for cleaning header.

def clear_table():
    header.columns.clear()

At this point we have everything we need for generating dataset, let's start writing main loop of the program.

Main loop

Create main loop at the end of script. First let's declare all commands available in program and will add functionality later.

running = True
# introduction, waiting for commands
print("====== Dataset Generator ======\n")
print('Type "help" for information.')
while running == True:
    user_input = input(">>> ")
    # list of available commands
    if user_input.lower().strip() == "help":
		# ...

	# show header
	elif user_input.lower().strip() == "h":
		# ...

	# add new column
	elif user_input.lower().strip() == "n":
		# ...

	# remove column by id
	elif user_input.lower().strip() == "r":
		# ...

	# change column
	elif user_input.lower().strip() == "c":
		# ...

	# generate dataset
	elif user_input.lower().strip() == "g":
		# ...

	# clear header
	elif user_input.lower().strip() == "reset":
		# ...

	# terminating on exit command
	elif user_input.lower().strip() == "exit":
		running = False

	elif user_input.strip() == "":
		pass

	else:
        print("Unknown command. Type 'help' for information.")

After the loop, at the very bottom of script add this line.

print("Program complete.")

Now, let's make commands working.

help - show list of commands
# list of available commands
if user_input.lower().strip() == "help":
    print(
        "List of commands:\n"
        "\th - show created table\n"
        "\tn - add new column\n"
        "\tr - remove column\n"
        "\tc - change column\n"
        "\tg - generate dataset\n"
        "\treset - clear table template\n"
        "\texit - terminate app\n"
    )
h - show header
# show header
elif user_input.lower().strip() == "h":
    header.show_header()
n - add new column
# add new column
elif user_input.lower().strip() == "n":
    input_name = input("(column name)~> ").strip()

    # column name can't be empty
    if input_name.strip() == "":
        print("Column name can't be empty.")
        continue

Here we added check if input for column name is empty.

	# specifying type of column
	try:
	    list_column_types()
	    input_type = input("(column type)(1-7)~> ")

	    input_type = ColumnType(int(input_type)).name
	except:
	    print("Incorrect data type. Enter type id from list of available options.")
	    continue

	created_column = Column(input_name, input_type)

Specifying column type.

	# if type is CITY ask user to enter country
	# if country not found generates city from random country
	if input_type == ColumnType.CITY.name:
	    print("enter 'countries' to see list of all countries.")
	    created_column.lo = (
	        input("(specify country/leave empty if any)~> ").strip().title()
	    )
	    while created_column.lo.lower() == "countries":
	        print(generators.get_all_countries())
	        created_column.lo = input("(specify country/leave empty if any)~> ")

	    # if country not found in list of countries write warning and exit
	    if (
	        created_column.lo.strip() != ""
	        and created_column.lo.strip() not in generators.get_all_countries()
	    ):
	        print("Country not found.")
	        continue

	    # if no input then any country
	    if created_column.lo == "":
	        created_column.lo = "any"

Here we added code for CITY type.

And, finally let's add code for RANDOM_NUMBER type.

	if type is RANDOM_NUMBER ask user to enter borders
	    if input_type == ColumnType.RANDOM_NUMBER.name:
	        try:
	            created_column.lo = int(input("(lowest number)~> "))
	            created_column.hi = int(input("(highest number)~> "))

	            if created_column.lo > created_column.hi:
	                print("Incorrect input. First number must be lower than second.")
	                continue
	        except:
	            print("Incorrect number.")
	            continue

	    header.add_column(created_column)
r - remove column
# remove column
elif user_input.lower().strip() == "r":
    # displaying all columns
    header.show_header()

    # getting id of column to delete
    user_input = input("(column id to delete)~> ")
    try:
        header.remove_column(int(user_input))
    except:
        print("Column with id {0} not found.".format(user_input))

This one is straight forward. User specifies id, program removes column with id, if not found - throws an error.

c - change column
# change column
elif user_input.lower().strip() == "c":
    # displaying all columns
    header.show_header()

    # getting id of column to change
    try:
        id_to_change = int(input("(column id to change)~> "))
    except:
        print("Invalid id.")

    what_to_change = input(
        "What'd you like to change?\n" "\t1) Name\n" "\t2) Type\n" "\t3) Both\n~> "
    )

Here we ask user to enter id of column to change and ask what he wants to change: name, type or both.

Below, we adding code for each of these 3 answers.

Changing name.

	# changing only name
	if what_to_change.strip() == "1":
	    new_name = input("(enter new name)~> ")
	    header.change_column_name(id_to_change, new_name)

	    print("Column name changed.")

Changing type.

	# changing only type
	elif what_to_change.strip() == "2":
	    list_column_types()
	    new_type = input("(column type)(1-7)~> ")

	    try:
	        new_type = ColumnType(int(new_type)).name
	        header.change_column_type(id_to_change, new_type)
	    except:
	        print(
	            "Incorrect data type. Enter type id from list of available options."
	        )
	        continue

Changing both name and type of column.

	# changing name and type
	elif what_to_change.strip() == "3":
	    # changing name
	    new_name = input("(enter new name)~> ")
	    header.change_column_name(id_to_change, new_name)

	    # changing type
	    list_column_types()
	    new_type = input("(column type)(1-7)~> ")
	    try:
	        new_type = ColumnType(int(new_type)).name
	        header.change_column_type(id_to_change, new_type)
	    except:
	        print(
	            "Incorrect data type. Enter type id from list of available options."
	        )
	        continue

	else:
	    print("Unknown option.")
g - generate data

Finally, the most important part of the code, which is generating our data.

# generate dataset
elif user_input.lower().strip() == "g":
    file_name = input("(enter name of CSV file)~> ")

    try:
        amount_of_rows = int(input("(how many rows to generate?)~> "))
    except:
        print("Invalid value.")

    print("Generating data...")

Program asks for name of dataset file and how many rows of data user needs to generate.

	# writing header to file
	f = open("{0}.csv".format(file_name), "a")
	for column in header.columns:
	    f.write("{0}, ".format(column.name))
	f.write("\n")
	f.close()

Here we open created CSV file and append header at the start.

	# generating rows of data, writing data to csv file
	f = open("{0}.csv".format(file_name), "a")
	for row_index in range(amount_of_rows):
	    data_row = DataRow()
	    data_row.generate()
	    f.write(str(data_row))
	f.close()

Now we generate all the data and append to file, then closing it.

	print("Generation complete.")
	answer = input("Do you want to continue working? (y/n): ").strip()
	if answer.lower() == "y":
	    answer = input("Clean last table template? (y/n): ")
	    if answer.lower() == "y":
	        clear_table()
	        continue
	    else:
	        continue

	elif answer.lower() == "n":
	    running = False

Finally, when dataset file is ready, we ask user whether or not he wants to continue working with program, if yes, should header be cleared or not, in case user wants to generate data with the same columns but different values or different amount of rows.

reset

Add last command, to clear header.

# clearing table template
elif user_input.lower().strip() == "reset":
    clear_table()
    print("Template cleared.")

Conclusion

Congratulations! Now you have your own dataset generator. But it's only generating CSV datasets, you can add generation for other types of datasets, and also add all the functionality you want.

If something isn't working, there is a chance you made some typos, you can compare what you have to the source code - https://github.com/touhf/Dataset_Generator

About

Tutorial on building dataset generator using Python programming language