How to find outliers frames in multiindex Dataframe











up vote
2
down vote

favorite












The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std



My attempt to identify the statistical outliers:



import pandas as pd
import numpy as np

#create sample
arrays = [[1,1,1,2,2,2,3,3],
[0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])

#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()

#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)


The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?










share|improve this question


























    up vote
    2
    down vote

    favorite












    The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std



    My attempt to identify the statistical outliers:



    import pandas as pd
    import numpy as np

    #create sample
    arrays = [[1,1,1,2,2,2,3,3],
    [0,1,2,0,1,2,0,1]]
    tuples = list(zip(*arrays))
    index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
    df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])

    #groupby index and learn from data
    g = df.groupby(level='INDEX')
    g_mean=g.mean()
    g_std = g.std()

    #groupby ID and look if some ID is an outlier
    g = df.groupby(level='ID')
    test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)


    The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?










    share|improve this question
























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std



      My attempt to identify the statistical outliers:



      import pandas as pd
      import numpy as np

      #create sample
      arrays = [[1,1,1,2,2,2,3,3],
      [0,1,2,0,1,2,0,1]]
      tuples = list(zip(*arrays))
      index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
      df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])

      #groupby index and learn from data
      g = df.groupby(level='INDEX')
      g_mean=g.mean()
      g_std = g.std()

      #groupby ID and look if some ID is an outlier
      g = df.groupby(level='ID')
      test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)


      The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?










      share|improve this question













      The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std



      My attempt to identify the statistical outliers:



      import pandas as pd
      import numpy as np

      #create sample
      arrays = [[1,1,1,2,2,2,3,3],
      [0,1,2,0,1,2,0,1]]
      tuples = list(zip(*arrays))
      index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
      df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])

      #groupby index and learn from data
      g = df.groupby(level='INDEX')
      g_mean=g.mean()
      g_std = g.std()

      #groupby ID and look if some ID is an outlier
      g = df.groupby(level='ID')
      test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)


      The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?







      python pandas numpy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 14 hours ago









      incognito

      206




      206
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          You can use:



          g_mean= df.mean(level='INDEX')
          g_std = df.std(level='INDEX')

          def f(x):
          #remove first level per group
          x = x.reset_index(level=0, drop=True)
          #detect outliers and check if all values are Trues
          m = (np.abs(x-g_mean) <= 3*g_std).values.all()
          return m

          #groupby ID and look if some ID is an outlier
          s = df.groupby(level='ID').apply(f)
          print (s)
          ID
          1 True
          2 True
          3 False
          dtype: bool

          #map second level by boolean Series and filter by boolean indexing
          df = df[df.index.get_level_values('ID').to_series().map(s).values]
          #if necessary, remove unnecessary levels in MultiIndex
          df.index = df.index.remove_unused_levels()
          print (df)
          Ts Tf
          ID INDEX
          1 0 0.612077 0.876833
          1 0.911303 0.377008
          2 0.326670 0.289647
          2 0 0.525381 0.599262
          1 1.336077 1.177081
          2 1.322341 0.572035





          share|improve this answer























          • Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
            – incognito
            13 hours ago










          • @incognito - not easy, added solution.
            – jezrael
            12 hours ago






          • 1




            Thanks again, u've solved it.
            – incognito
            12 hours ago













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53183970%2fhow-to-find-outliers-frames-in-multiindex-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest
































          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          You can use:



          g_mean= df.mean(level='INDEX')
          g_std = df.std(level='INDEX')

          def f(x):
          #remove first level per group
          x = x.reset_index(level=0, drop=True)
          #detect outliers and check if all values are Trues
          m = (np.abs(x-g_mean) <= 3*g_std).values.all()
          return m

          #groupby ID and look if some ID is an outlier
          s = df.groupby(level='ID').apply(f)
          print (s)
          ID
          1 True
          2 True
          3 False
          dtype: bool

          #map second level by boolean Series and filter by boolean indexing
          df = df[df.index.get_level_values('ID').to_series().map(s).values]
          #if necessary, remove unnecessary levels in MultiIndex
          df.index = df.index.remove_unused_levels()
          print (df)
          Ts Tf
          ID INDEX
          1 0 0.612077 0.876833
          1 0.911303 0.377008
          2 0.326670 0.289647
          2 0 0.525381 0.599262
          1 1.336077 1.177081
          2 1.322341 0.572035





          share|improve this answer























          • Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
            – incognito
            13 hours ago










          • @incognito - not easy, added solution.
            – jezrael
            12 hours ago






          • 1




            Thanks again, u've solved it.
            – incognito
            12 hours ago

















          up vote
          1
          down vote



          accepted










          You can use:



          g_mean= df.mean(level='INDEX')
          g_std = df.std(level='INDEX')

          def f(x):
          #remove first level per group
          x = x.reset_index(level=0, drop=True)
          #detect outliers and check if all values are Trues
          m = (np.abs(x-g_mean) <= 3*g_std).values.all()
          return m

          #groupby ID and look if some ID is an outlier
          s = df.groupby(level='ID').apply(f)
          print (s)
          ID
          1 True
          2 True
          3 False
          dtype: bool

          #map second level by boolean Series and filter by boolean indexing
          df = df[df.index.get_level_values('ID').to_series().map(s).values]
          #if necessary, remove unnecessary levels in MultiIndex
          df.index = df.index.remove_unused_levels()
          print (df)
          Ts Tf
          ID INDEX
          1 0 0.612077 0.876833
          1 0.911303 0.377008
          2 0.326670 0.289647
          2 0 0.525381 0.599262
          1 1.336077 1.177081
          2 1.322341 0.572035





          share|improve this answer























          • Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
            – incognito
            13 hours ago










          • @incognito - not easy, added solution.
            – jezrael
            12 hours ago






          • 1




            Thanks again, u've solved it.
            – incognito
            12 hours ago















          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          You can use:



          g_mean= df.mean(level='INDEX')
          g_std = df.std(level='INDEX')

          def f(x):
          #remove first level per group
          x = x.reset_index(level=0, drop=True)
          #detect outliers and check if all values are Trues
          m = (np.abs(x-g_mean) <= 3*g_std).values.all()
          return m

          #groupby ID and look if some ID is an outlier
          s = df.groupby(level='ID').apply(f)
          print (s)
          ID
          1 True
          2 True
          3 False
          dtype: bool

          #map second level by boolean Series and filter by boolean indexing
          df = df[df.index.get_level_values('ID').to_series().map(s).values]
          #if necessary, remove unnecessary levels in MultiIndex
          df.index = df.index.remove_unused_levels()
          print (df)
          Ts Tf
          ID INDEX
          1 0 0.612077 0.876833
          1 0.911303 0.377008
          2 0.326670 0.289647
          2 0 0.525381 0.599262
          1 1.336077 1.177081
          2 1.322341 0.572035





          share|improve this answer














          You can use:



          g_mean= df.mean(level='INDEX')
          g_std = df.std(level='INDEX')

          def f(x):
          #remove first level per group
          x = x.reset_index(level=0, drop=True)
          #detect outliers and check if all values are Trues
          m = (np.abs(x-g_mean) <= 3*g_std).values.all()
          return m

          #groupby ID and look if some ID is an outlier
          s = df.groupby(level='ID').apply(f)
          print (s)
          ID
          1 True
          2 True
          3 False
          dtype: bool

          #map second level by boolean Series and filter by boolean indexing
          df = df[df.index.get_level_values('ID').to_series().map(s).values]
          #if necessary, remove unnecessary levels in MultiIndex
          df.index = df.index.remove_unused_levels()
          print (df)
          Ts Tf
          ID INDEX
          1 0 0.612077 0.876833
          1 0.911303 0.377008
          2 0.326670 0.289647
          2 0 0.525381 0.599262
          1 1.336077 1.177081
          2 1.322341 0.572035






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 12 hours ago

























          answered 13 hours ago









          jezrael

          303k20233309




          303k20233309












          • Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
            – incognito
            13 hours ago










          • @incognito - not easy, added solution.
            – jezrael
            12 hours ago






          • 1




            Thanks again, u've solved it.
            – incognito
            12 hours ago




















          • Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
            – incognito
            13 hours ago










          • @incognito - not easy, added solution.
            – jezrael
            12 hours ago






          • 1




            Thanks again, u've solved it.
            – incognito
            12 hours ago


















          Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
          – incognito
          13 hours ago




          Yeah, thank you very much. That almost solves my problem. df[test] will only remove the last line of the frame. How can I get back the dataframe without the whole third block? PS: Do you know, if its possible to easy vectorize this apply-func (never done this)?
          – incognito
          13 hours ago












          @incognito - not easy, added solution.
          – jezrael
          12 hours ago




          @incognito - not easy, added solution.
          – jezrael
          12 hours ago




          1




          1




          Thanks again, u've solved it.
          – incognito
          12 hours ago






          Thanks again, u've solved it.
          – incognito
          12 hours ago




















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53183970%2fhow-to-find-outliers-frames-in-multiindex-dataframe%23new-answer', 'question_page');
          }
          );

          Post as a guest