How to find outliers frames in multiindex Dataframe

up vote
2
down vote

favorite

The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std

My attempt to identify the statistical outliers:

import pandas as pd

import numpy as np



#create sample

arrays = [[1,1,1,2,2,2,3,3],

          [0,1,2,0,1,2,0,1]]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])

df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])



#groupby index and learn from data

g = df.groupby(level='INDEX')

g_mean=g.mean()

g_std = g.std()



#groupby ID and look if some ID is an outlier

g = df.groupby(level='ID')

test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)

The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?

asked 14 hours ago

incognito

206

add a comment |

up vote
2
down vote

favorite

The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std

My attempt to identify the statistical outliers:

import pandas as pd

import numpy as np



#create sample

arrays = [[1,1,1,2,2,2,3,3],

          [0,1,2,0,1,2,0,1]]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])

df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])



#groupby index and learn from data

g = df.groupby(level='INDEX')

g_mean=g.mean()

g_std = g.std()



#groupby ID and look if some ID is an outlier

g = df.groupby(level='ID')

test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)

The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?

asked 14 hours ago

incognito

206

add a comment |

up vote
2
down vote

favorite

The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std

My attempt to identify the statistical outliers:

import pandas as pd

import numpy as np



#create sample

arrays = [[1,1,1,2,2,2,3,3],

          [0,1,2,0,1,2,0,1]]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])

df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])



#groupby index and learn from data

g = df.groupby(level='INDEX')

g_mean=g.mean()

g_std = g.std()



#groupby ID and look if some ID is an outlier

g = df.groupby(level='ID')

test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)

The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?

asked 14 hours ago

incognito

206

The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std

My attempt to identify the statistical outliers:

import pandas as pd

import numpy as np



#create sample

arrays = [[1,1,1,2,2,2,3,3],

          [0,1,2,0,1,2,0,1]]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])

df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])



#groupby index and learn from data

g = df.groupby(level='INDEX')

g_mean=g.mean()

g_std = g.std()



#groupby ID and look if some ID is an outlier

g = df.groupby(level='ID')

test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)

The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?

python pandas numpy

asked 14 hours ago

incognito

206

asked 14 hours ago

incognito

206

asked 14 hours ago

incognito

206

asked 14 hours ago

incognito

206

asked 14 hours ago

incognito

206

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You can use:

g_mean= df.mean(level='INDEX')

g_std = df.std(level='INDEX')



def f(x):

    #remove first level per group

    x = x.reset_index(level=0, drop=True)

    #detect outliers and check if all values are Trues   

    m = (np.abs(x-g_mean) <= 3*g_std).values.all()

    return m



#groupby ID and look if some ID is an outlier

s = df.groupby(level='ID').apply(f)

print (s)

ID

1     True

2     True

3    False

dtype: bool



#map second level by boolean Series and filter by boolean indexing

df = df[df.index.get_level_values('ID').to_series().map(s).values]

#if necessary, remove unnecessary levels in MultiIndex

df.index = df.index.remove_unused_levels()

print (df)

                Ts        Tf

ID INDEX                    

1  0      0.612077  0.876833

   1      0.911303  0.377008

   2      0.326670  0.289647

2  0      0.525381  0.599262

   1      1.336077  1.177081

   2      1.322341  0.572035

edited 12 hours ago

answered 13 hours ago

jezrael

303k20233309

Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago

@incognito - not easy, added solution.
– jezrael
12 hours ago

1

Thanks again, u've solved it.
– incognito
12 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53183970%2fhow-to-find-outliers-frames-in-multiindex-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

You can use:

g_mean= df.mean(level='INDEX')

g_std = df.std(level='INDEX')



def f(x):

    #remove first level per group

    x = x.reset_index(level=0, drop=True)

    #detect outliers and check if all values are Trues   

    m = (np.abs(x-g_mean) <= 3*g_std).values.all()

    return m



#groupby ID and look if some ID is an outlier

s = df.groupby(level='ID').apply(f)

print (s)

ID

1     True

2     True

3    False

dtype: bool



#map second level by boolean Series and filter by boolean indexing

df = df[df.index.get_level_values('ID').to_series().map(s).values]

#if necessary, remove unnecessary levels in MultiIndex

df.index = df.index.remove_unused_levels()

print (df)

                Ts        Tf

ID INDEX                    

1  0      0.612077  0.876833

   1      0.911303  0.377008

   2      0.326670  0.289647

2  0      0.525381  0.599262

   1      1.336077  1.177081

   2      1.322341  0.572035

edited 12 hours ago

answered 13 hours ago

jezrael

303k20233309

Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago

@incognito - not easy, added solution.
– jezrael
12 hours ago

1

Thanks again, u've solved it.
– incognito
12 hours ago

add a comment |

up vote
1
down vote

accepted

You can use:

g_mean= df.mean(level='INDEX')

g_std = df.std(level='INDEX')



def f(x):

    #remove first level per group

    x = x.reset_index(level=0, drop=True)

    #detect outliers and check if all values are Trues   

    m = (np.abs(x-g_mean) <= 3*g_std).values.all()

    return m



#groupby ID and look if some ID is an outlier

s = df.groupby(level='ID').apply(f)

print (s)

ID

1     True

2     True

3    False

dtype: bool



#map second level by boolean Series and filter by boolean indexing

df = df[df.index.get_level_values('ID').to_series().map(s).values]

#if necessary, remove unnecessary levels in MultiIndex

df.index = df.index.remove_unused_levels()

print (df)

                Ts        Tf

ID INDEX                    

1  0      0.612077  0.876833

   1      0.911303  0.377008

   2      0.326670  0.289647

2  0      0.525381  0.599262

   1      1.336077  1.177081

   2      1.322341  0.572035

edited 12 hours ago

answered 13 hours ago

jezrael

303k20233309

Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago

@incognito - not easy, added solution.
– jezrael
12 hours ago

1

Thanks again, u've solved it.
– incognito
12 hours ago

add a comment |

up vote
1
down vote

accepted

You can use:

g_mean= df.mean(level='INDEX')

g_std = df.std(level='INDEX')



def f(x):

    #remove first level per group

    x = x.reset_index(level=0, drop=True)

    #detect outliers and check if all values are Trues   

    m = (np.abs(x-g_mean) <= 3*g_std).values.all()

    return m



#groupby ID and look if some ID is an outlier

s = df.groupby(level='ID').apply(f)

print (s)

ID

1     True

2     True

3    False

dtype: bool



#map second level by boolean Series and filter by boolean indexing

df = df[df.index.get_level_values('ID').to_series().map(s).values]

#if necessary, remove unnecessary levels in MultiIndex

df.index = df.index.remove_unused_levels()

print (df)

                Ts        Tf

ID INDEX                    

1  0      0.612077  0.876833

   1      0.911303  0.377008

   2      0.326670  0.289647

2  0      0.525381  0.599262

   1      1.336077  1.177081

   2      1.322341  0.572035

edited 12 hours ago

answered 13 hours ago

jezrael

303k20233309

You can use:

g_mean= df.mean(level='INDEX')

g_std = df.std(level='INDEX')



def f(x):

    #remove first level per group

    x = x.reset_index(level=0, drop=True)

    #detect outliers and check if all values are Trues   

    m = (np.abs(x-g_mean) <= 3*g_std).values.all()

    return m



#groupby ID and look if some ID is an outlier

s = df.groupby(level='ID').apply(f)

print (s)

ID

1     True

2     True

3    False

dtype: bool



#map second level by boolean Series and filter by boolean indexing

df = df[df.index.get_level_values('ID').to_series().map(s).values]

#if necessary, remove unnecessary levels in MultiIndex

df.index = df.index.remove_unused_levels()

print (df)

                Ts        Tf

ID INDEX                    

1  0      0.612077  0.876833

   1      0.911303  0.377008

   2      0.326670  0.289647

2  0      0.525381  0.599262

   1      1.336077  1.177081

   2      1.322341  0.572035

edited 12 hours ago

answered 13 hours ago

jezrael

303k20233309

edited 12 hours ago

answered 13 hours ago

jezrael

303k20233309

answered 13 hours ago

jezrael

303k20233309

answered 13 hours ago

jezrael

303k20233309

Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago

@incognito - not easy, added solution.
– jezrael
12 hours ago

1

Thanks again, u've solved it.
– incognito
12 hours ago

add a comment |

Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago

@incognito - not easy, added solution.
– jezrael
12 hours ago

1

Thanks again, u've solved it.
– incognito
12 hours ago

Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
– incognito
13 hours ago

@incognito - not easy, added solution.
– jezrael
12 hours ago

Thanks again, u've solved it.
– incognito
12 hours ago

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Zystkmt