RoboticsClubIITJ / ML-DL-implementation

An implementation of ML and DL algorithms from scratch in python using nothing but NumPy and Matplotlib.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implement "Numerical outlier method" , to detect Anomaly/Outlier points in Dataset

Halix267 opened this issue · comments

@agrawalshubham01 Can I work on this

@Halix267 which method are you going to implement ?

for detection of outliers

Kindly draft your api here.

could u brief ? I m not getting

@Halix267 draft your model code here.

def dataset_developer(n, probability=[0.05, 0.05, 0.15, 0.5, 0.15]):
X = []
while len(X)<n:
t = random.random()
if t<probability[0]:
X.append(random.random()*0.2)
elif t>=probability[0] and t<probability[1]+probability[0]:
X.append(random.random()*0.2+0.2)
elif t>= probability[1]+probability[0] and t<probability[0]+probability[1]+probability[2]:
X.append(random.random()0.2+0.22)
elif t>=probability[0]+probability[1]+probability[2] and t<probability[0]+probability[1]+probability[2]+probability[3]:
X.append(random.random()0.2+0.23)
else:
X.append(random.random()0.2+0.24)
return X

def numeric_outlier(X, k):
Q1 = X[int((len(X)+1)/4)]
Q3 = X[int(((len(X)+1)3)/4)]
IQR = Q3 - Q1
outliers_index = [i for i in range(len(X)) if X[i]<Q1-k
IQR or X[i]>Q3+k*IQR]
return outliers_index

def imager(X, Y, outlier_X, outlier_Y, l):
X = random.sample(X, len(X))
outlier_X = random.sample(outlier_X, l)
plt.scatter(X, Y, color='green', marker='.')
plt.scatter(outlier_X, outlier_Y, color='red', marker='*')
plt.plot()

def accuracy(outliers_index, X):
accuracy = 0
for j in range(len(X)):
if j in outliers_index:
if X[j]<0.5 or X[j]>0.9:
accuracy += 1
else:
if X[j]>=0.5 and X[j]<=0.9:
accuracy += 1
return accuracy/len(X)

def main(X, k):
outliers_index = numeric_outlier(X, k)
outliers = [X[i] for i in outliers_index]
X_c = [X[i] for i in range(len(X)) if i not in outliers_index]
imager(range(len(X_c)), X_c, range(len(X_c)), outliers, len(outliers_index))
return accuracy(outliers_index, X)

k = 0.45
random.seed(0)
X = dataset_developer(1000)
X.sort()
acc = main(X, k)
print(acc)

What I am trying to say is how would you create class, what would be methods and attributes, how could one use these classes. This is a python package just we need to define related functions using class so that we can inherit its properties. How could end user access it.
I got the logic, what I am asking is how would you draft your code such that some one else can use this

@rohansingh9001 Though I think logic looks fine, help him with drafting an API for the same. We can keep this in a folder like Preprossesers

Yes guide me for drafing the API for the same @agrawalshubham01 @rohansingh9001

@Halix267 sorry for the delay, I was busy in other important commitments.

However, in an issue try to be as descriptive as you can. There are no resources given for us to understand what Numerical Outlier method is or what can it do.
Even if we do have knowledge about this algorithm from background education, we still do not know how robust your solution will be or what all it can do.

Secondly, I would recommend you to go through the code in the Examples directory. It contains examples of how an end-user can use our library.

A user should be able to import your code and apply on his custom dataset.

For example, the Linear Regression model has a .fit() method which trains the model on the data given to it. There are also other various methods in it.
While your code might logically be correct, please provide what sort of data this model needs to train, what kind of predictions it can make and what functions the end user will call to train this model and run it.

That is what @agrawalshubham01 meant by discussing an API.

ok @rohansingh9001 I wil provide the dataset but first can u guide me How i can proceed on this issue

@rohansingh9001

And about your question imagine the user is preprocessing the dataset and in that dataset suppose there is height column in feet...

And the values are 1.2, 1.3 , 4, 5.1, 5.6, 6.3, 41.2, 50 .....

clearly 1.2,1.3 and 41.2 , 50 are the outliers which results in decreasing model performance.

So to improve the model performance user can find the outliers and exclude them accordingly

@rohansingh9001 @agrawalshubham01 Plzz look at this , Can I start working on this issue?

I would like to work on this issue

Sure 👍

Thanks to @Siddharth-Singh27, the issue's been solved. Hereby, closing it.