根据条件从 python 字典中选择特定范围的元素

Select specific range of elements from a python dictionary based on condition

提问人:spectre 提问时间:8/2/2023 最后编辑:spectre 更新时间:8/10/2023 访问量:118

问:

我有以下词典:

ip_dict = 
{
    "doc_1" : {
                "img_1" : ("FP","some long text"),
                "img_2" : ("LP", "another long text"),
                "img_3" : ("Others", "long text"),
                "img_4" : ("Others", "some loong text"),
                "img_5" : ("FP", "one more text"),
                "img_6" : ("FP", "another one"),
                "img_7" : ("LP", "ANOTHER ONE"),
                "img_8" : ("Others", "some text"),
                "img_9" : ("Others", "some moretext"),
                "img_10" : ("FP", "more text"),
                "img_11" : ("Others", "whatever"),
                "img_12" : ("Others", "more whatever"),
                "img_13" : ("LP", "SoMe TeXt"),
                "img_14" : ("Others", "some moretext"),
                "img_15" : ("FP", "whatever"),
                "img_16" : ("Others", "whatever"),
                "img_17" : ("LP", "whateverrr")
            },

    "doc_2" : {
                "img_1" : ("FP", "text"),
                "img_2" : ("FP", "more text"),
                "img_3" : ("LP", "more more text"),
                "img_4" : ("Others", "some more"),
                "img_5" : ("Others", "text text"),
                "img_6" : ("FP", "more more text"),
                "img_7" : ("Others", "lot of text"),
                "img_8" : ("LP", "still more text")
            }

}

这里代表第一页和最后一页。对于所有我只想提取 和 .对于 ,如果它们位于 和 之间,则仅提取它们,因为它们表示 和 之间的页面。如果他们躺在外面,然后忽略他们。此外,对于后面没有 a ,将它们视为单个页面并提取它们。所以我的输出字典是这样的:FPLPdocsFPLPOthersFPLPFPLPFPLPFPLP

op_dict = 
{
    "doc_1" : [
                {
                "img_1" : ("FP","some long text"),
                "img_2" : ("LP", "another long text")
                },

                {
                    "img_5" : ("FP", "one more text")
                },

                {
                    "img_6" : ("FP", "another one"),
                    "img_7" : ("LP", "ANOTHER ONE")
                },

                {
                    "img_10" : ("FP", "more text"),
                    "img_11" : ("Others", "whatever"),
                    "img_12" : ("Others", "more whatever"),
                    "img_13" : ("LP", "SoMe TeXt"),
                },

                {
                    "img_15" : ("FP", "whatever"),
                    "img_16" : ("Others", "whatever"),
                    "img_17" : ("LP", "whateverrr"),
                }
            ],


    "doc_2" : [

                {
                "img_1" : ("FP", "text")
                },

                {        
                "img_2" : ("FP", "more text"),
                "img_3" : ("LP", "more more text")
                },        

                {
                "img_6" : ("FP", "more more text"),
                "img_7" : ("Others", "lot of text"),
                "img_8" : ("LP", "still more text")
                },

            ]
}

如您所见,所有 和 都已被提取,但还有介于两者之间的那些也被提取并存储在字典中。此外,那些没有后面跟着 a 的也被提取出来。FPLPOthersFPLPFPLP

附言:

ip_dict = 
{
    "doc_1" : {
                "img_1" : ("LP","some long text"),
                "img_2" : ("Others", "another long text"),
                "img_3" : ("Others", "long text"),
                "img_4" : ("FP", "long text"),
                "img_5" : ("Others", "long text"),
                "img_6" : ("LP", "long text")
            }
}

op_dict =     {
        "doc_1" : [{
                    "img_1" : ("LP","some long text")
                },
                    {
                    "img_4" : ("FP", "long text"),
                    "img_5" : ("Others", "long text"),
                    "img_6" : ("LP", "long text")
                    }
                  ]
    
              }

任何帮助都是值得赞赏的!

python-3.x 字典 数据

评论

0赞 matszwecja 8/2/2023
词典没有顺序(好吧,至少在概念上是这样)。它们中没有元素的“之间”。
0赞 8/2/2023
@matszwecja 是的,但作者说的是,如果出现在两者之间,那么只考虑它们。如果没有,请忽略它们OthersFPLP
0赞 matszwecja 8/2/2023
@shreyjain 但是它们不能出现在无结构的“之间”。
0赞 John Collins 8/2/2023
从 Python3.6+ 开始,字典确实保留了它们的插入顺序(参见:stackoverflow.com/questions/39980323/...)。

答:

0赞 Krittipoom 8/2/2023 #1

这是我的解决方案,很长:

for doc in ip_dict:
    print('\n', doc, '\n')

    ignore = True

    for img in ip_dict[doc]:
    
        TYPE = ip_dict[doc][img][0] # FP or LP
        TEXT = ip_dict[doc][img][1] # The text
    
        if TYPE == 'FP':
            ignore = False
    
        if ignore == False:
            print(img,' :\t', TYPE, '/', TEXT)
        
        if TYPE == 'LP':
            ignore = True

结果:

doc_1 

img_1  :     FP / some long text
img_2  :     LP / another long text
img_5  :     FP / one more text
img_6  :     FP / another one
img_7  :     LP / ANOTHER ONE
img_10  :    FP / more text
img_11  :    Others / whatever
img_12  :    Others / more whatever
img_13  :    LP / SoMe TeXt
img_15  :    FP / whatever
img_16  :    Others / whatever
img_17  :    LP / whateverrr

doc_2 

img_1  :     FP / text
img_2  :     FP / more text
img_3  :     LP / more more text
img_6  :     FP / more more text
img_7  :     Others / lot of text
img_8  :     LP / still more text

评论

0赞 8/2/2023
这不会以字典的形式给出输出,这是用户要求的!
0赞 Debi Prasad 8/2/2023 #2

试试这个方法。这是标志方法的经典用法,但正如评论所说,它只有在您按顺序输入字典时才会起作用。就像现在一样,它正在提供所需的输出


def process(ip_dict):
    op_dict=dict()
    for key,value in ip_dict.items():
        op_list=[]
        fp_counter=0
        lp_counter=0
        op_dup=dict()
        for key1,value1 in value.items():
            if value1[0] == "FP" and fp_counter==1:
                fp_counter=1
                if len(op_dup) != 0:
                    op_list.append(op_dup)
                op_dup=dict()
                op_dup[key1]=value1
                continue
            
            if value1[0] == "FP" and fp_counter==0:
                fp_counter=1
                
               
            if value1[0] == "LP" and lp_counter==1:
                lp_counter=1
                if len(op_dup) != 0:
                    op_list.append(op_dup)
                op_dup=dict()
                op_dup[key1]=value1
                continue
            
            if value1[0] == "LP" and lp_counter==0:
                lp_counter=1
                
            if(lp_counter==0 and fp_counter == 1):
                op_dup[key1]=value1
                
            if(lp_counter == 1 and fp_counter == 1 and value1[0] == "LP"):
                op_dup[key1]=value1
                
            if(lp_counter == 1 and fp_counter == 1 and value1[0] != "LP"):
                if len(op_dup) != 0:
                    op_list.append(op_dup)
                op_dup=dict()
                lp_counter=0
                fp_counter=0
        if(len(op_dup) != 0):
            op_list.append(op_dup)
        op_dict[key]=op_list
    return op_dict

print(process(ip_dict))     
1赞 John Collins 8/2/2023 #3

一种可能的方法:

op_dict = {}
first_page = None
for doc, imgs in ip_dict.items():
    op_dict[doc] = []
    for k, v in imgs.items():
        if v[0] == "FP":
            if first_page:
                if len(new.keys()) == 1:
                    op_dict[doc].append(new)
                else:
                    op_dict[doc].append(
                        {list(new.keys())[0]: list(new.values())[0]}
                    )
                new = {}
            else:
                new = {k: v}
                first_page = True
                continue
        if first_page:
            new[k] = v
            if v[0] == "LP":
                op_dict[doc].append(new)
                first_page = False
    if first_page:
        op_dict[doc].append({k: v})

这给出了:

{'doc_1': [{'img_1': ('FP', 'some long text'),
   'img_2': ('LP', 'another long text')},
  {'img_5': ('FP', 'one more text')},
  {'img_6': ('FP', 'another one'), 'img_7': ('LP', 'ANOTHER ONE')},
  {'img_61': ('FP', 'another one'), 'img_71': ('LP', 'ANOTHER ONE')},
  {'img_62': ('FP', 'another one'), 'img_72': ('LP', 'ANOTHER ONE')},
  {'img_54': ('FP', 'one more text')},
  {'img_540': ('FP', 'one more text')},
  {'img_541': ('FP', 'one more text')},
  {'img_13': ('FP', 'more text'),
   'img_14': ('Others', 'whatever'),
   'img_140': ('Others', 'whatever'),
   'img_141': ('Others', 'whatever'),
   'img_142': ('Others', 'whatever'),
   'img_15': ('Others', 'more whatever'),
   'img_16': ('LP', 'SoMe TeXt')},
  {'img_18': ('FP', 'whatever'),
   'img_19': ('Others', 'whatever'),
   'img_20': ('LP', 'whateverrr')}],
 'doc_2': [{'img_1': ('FP', 'text')},
  {'img_2': ('FP', 'more text'), 'img_3': ('LP', 'more more text')},
  {'img_6': ('FP', 'more more text'),
   'img_7': ('Others', 'lot of text'),
   'img_8': ('LP', 'still more text')},
  {'img_69': ('FP', 'more more text')}]}

评论

1赞 spectre 8/2/2023
当最后一个元素为 时,您的解决方案不起作用。此外,当 a 后面跟着 然后您的解决方案将这 3 个组合在一起时,这不是预期的结果。我在问题中添加了示例。FPFPOthersFP
0赞 spectre 8/2/2023
我还在问题中添加了示例以及预期的输出
1赞 John Collins 8/2/2023
@spectre啊。还行。我已经更新了答案。
1赞 spectre 8/2/2023
像魅力一样工作!
3赞 RomanPerekhrest 8/2/2023 #4

具有扩展的顺序逻辑:

def select_page_ranges(d: dict):

    def _del_excess_items():
        # if previous block was not closed and has excess entries
        if start and last_mark != 'FP':
            res[pk][-1] = {start_key: res[pk][-1][start_key]}

    res = {}
    for pk, v in ip_dict.items():
        res[pk] = []
        start, start_key, last_mark = None, None, ''
        for k, v in v.items():
            if v[0] == 'FP':
                _del_excess_items()
                res[pk].append({k: v})
                start = True
                start_key = k
            elif v[0] == 'LP':
                res[pk][-1].update({k: v})
                start = False
            elif start:
                res[pk][-1].update({k: v})
            last_mark = v[0]
        _del_excess_items()
    return res

print(select_page_ranges(ip_dict))

{'doc_1': [{'img_1': ('FP', 'some long text'),
            'img_2': ('LP', 'another long text')},
           {'img_5': ('FP', 'one more text')},
           {'img_6': ('FP', 'another one'), 'img_7': ('LP', 'ANOTHER ONE')},
           {'img_61': ('FP', 'another one'), 'img_71': ('LP', 'ANOTHER ONE')},
           {'img_62': ('FP', 'another one'), 'img_72': ('LP', 'ANOTHER ONE')},
           {'img_54': ('FP', 'one more text')},
           {'img_540': ('FP', 'one more text')},
           {'img_541': ('FP', 'one more text')},
           {'img_13': ('FP', 'more text'),
            'img_14': ('Others', 'whatever'),
            'img_140': ('Others', 'whatever'),
            'img_141': ('Others', 'whatever'),
            'img_142': ('Others', 'whatever'),
            'img_15': ('Others', 'more whatever'),
            'img_16': ('LP', 'SoMe TeXt')},
           {'img_18': ('FP', 'whatever'),
            'img_19': ('Others', 'whatever'),
            'img_20': ('LP', 'whateverrr')}],
 'doc_2': [{'img_1': ('FP', 'text')},
           {'img_2': ('FP', 'more text'), 'img_3': ('LP', 'more more text')},
           {'img_6': ('FP', 'more more text'),
            'img_7': ('Others', 'lot of text'),
            'img_8': ('LP', 'still more text')},
           {'img_69': ('FP', 'more more text')}]}

评论

1赞 spectre 8/2/2023
@RomanPerekhrest 当我们连续 .我已经在我的问题中添加了示例示例以及预期的输出FP
1赞 RomanPerekhrest 8/2/2023
@spectre当我们有连续的 FP 时 - 你应该说“当我们有 2 个以上的连续 FP 时”
1赞 spectre 8/2/2023
@RomanPerekhrest 对不起。一定是逃过了我的注意!
1赞 RomanPerekhrest 8/2/2023
@spectre,请参阅我的更新
1赞 RomanPerekhrest 8/3/2023
@spectre,请检查我的更新