How to find outliers frames in multiindex Dataframe
up vote
2
down vote
favorite
The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std
My attempt to identify the statistical outliers:
import pandas as pd
import numpy as np
#create sample
arrays = [[1,1,1,2,2,2,3,3],
[0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])
#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()
#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)
The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?
python pandas numpy
add a comment |
up vote
2
down vote
favorite
The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std
My attempt to identify the statistical outliers:
import pandas as pd
import numpy as np
#create sample
arrays = [[1,1,1,2,2,2,3,3],
[0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])
#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()
#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)
The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?
python pandas numpy
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std
My attempt to identify the statistical outliers:
import pandas as pd
import numpy as np
#create sample
arrays = [[1,1,1,2,2,2,3,3],
[0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])
#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()
#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)
The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?
python pandas numpy
The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std
My attempt to identify the statistical outliers:
import pandas as pd
import numpy as np
#create sample
arrays = [[1,1,1,2,2,2,3,3],
[0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])
#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()
#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)
The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?
python pandas numpy
python pandas numpy
asked 14 hours ago
incognito
206
206
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
You can use:
g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')
def f(x):
#remove first level per group
x = x.reset_index(level=0, drop=True)
#detect outliers and check if all values are Trues
m = (np.abs(x-g_mean) <= 3*g_std).values.all()
return m
#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1 True
2 True
3 False
dtype: bool
#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
Ts Tf
ID INDEX
1 0 0.612077 0.876833
1 0.911303 0.377008
2 0.326670 0.289647
2 0 0.525381 0.599262
1 1.336077 1.177081
2 1.322341 0.572035
Yeah, thank you very much. That almost solves my problem.df[test]will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
1
Thanks again, u've solved it.
– incognito
12 hours ago
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
You can use:
g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')
def f(x):
#remove first level per group
x = x.reset_index(level=0, drop=True)
#detect outliers and check if all values are Trues
m = (np.abs(x-g_mean) <= 3*g_std).values.all()
return m
#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1 True
2 True
3 False
dtype: bool
#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
Ts Tf
ID INDEX
1 0 0.612077 0.876833
1 0.911303 0.377008
2 0.326670 0.289647
2 0 0.525381 0.599262
1 1.336077 1.177081
2 1.322341 0.572035
Yeah, thank you very much. That almost solves my problem.df[test]will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
1
Thanks again, u've solved it.
– incognito
12 hours ago
add a comment |
up vote
1
down vote
accepted
You can use:
g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')
def f(x):
#remove first level per group
x = x.reset_index(level=0, drop=True)
#detect outliers and check if all values are Trues
m = (np.abs(x-g_mean) <= 3*g_std).values.all()
return m
#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1 True
2 True
3 False
dtype: bool
#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
Ts Tf
ID INDEX
1 0 0.612077 0.876833
1 0.911303 0.377008
2 0.326670 0.289647
2 0 0.525381 0.599262
1 1.336077 1.177081
2 1.322341 0.572035
Yeah, thank you very much. That almost solves my problem.df[test]will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
1
Thanks again, u've solved it.
– incognito
12 hours ago
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
You can use:
g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')
def f(x):
#remove first level per group
x = x.reset_index(level=0, drop=True)
#detect outliers and check if all values are Trues
m = (np.abs(x-g_mean) <= 3*g_std).values.all()
return m
#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1 True
2 True
3 False
dtype: bool
#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
Ts Tf
ID INDEX
1 0 0.612077 0.876833
1 0.911303 0.377008
2 0.326670 0.289647
2 0 0.525381 0.599262
1 1.336077 1.177081
2 1.322341 0.572035
You can use:
g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')
def f(x):
#remove first level per group
x = x.reset_index(level=0, drop=True)
#detect outliers and check if all values are Trues
m = (np.abs(x-g_mean) <= 3*g_std).values.all()
return m
#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1 True
2 True
3 False
dtype: bool
#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
Ts Tf
ID INDEX
1 0 0.612077 0.876833
1 0.911303 0.377008
2 0.326670 0.289647
2 0 0.525381 0.599262
1 1.336077 1.177081
2 1.322341 0.572035
edited 12 hours ago
answered 13 hours ago
jezrael
303k20233309
303k20233309
Yeah, thank you very much. That almost solves my problem.df[test]will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
1
Thanks again, u've solved it.
– incognito
12 hours ago
add a comment |
Yeah, thank you very much. That almost solves my problem.df[test]will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
1
Thanks again, u've solved it.
– incognito
12 hours ago
Yeah, thank you very much. That almost solves my problem.
df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?– incognito
13 hours ago
Yeah, thank you very much. That almost solves my problem.
df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?– incognito
13 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
@incognito - not easy, added solution.
– jezrael
12 hours ago
1
1
Thanks again, u've solved it.
– incognito
12 hours ago
Thanks again, u've solved it.
– incognito
12 hours ago
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53183970%2fhow-to-find-outliers-frames-in-multiindex-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password