提问人:Pablo Pinillos 提问时间:11/12/2023 最后编辑:Pablo Pinillos 更新时间:11/13/2023 访问量:63
在MongoDB集合中按频率对文档中的数组进行排序
Sorting arrays by frequency inside documents in MongoDB collection
问:
我正在尝试在MongoDB集合的每个文档中按频率对几个数组进行排序。
我现在的文档看起来像
{
"_id": whatever,
"color": "blue",
"fruit": ["apple", "banana", "apple", "orange", "apple", "orange", ...],
"vegetable": ["onion", "lettuce", "spinach", "lettuce", ...],
"meat": ["pulled pork", "steak", "chicken wings", "pulled pork", "pulled pork", ...]
}
注意:这不是真实数据,但文档的属性完全相同。
最终目标是为每种颜色找到最常见的水果、蔬菜和肉类,所以我猜如果我可以按颜色分组并得到按频率排序的每个数组的第一个元素,那将满足我的需求。
我试过展开,但我的数据库太大了,无法为每个数组展开(每个数组都有大约 50.000 个元素,所以 50.000^3 似乎并不理想)。我还寻找了一个“模式”组函数,因为 MongoDB 有一个“中位数”函数,但似乎没有 (v5.0.22)。我还查看了流水线的映射和reduce函数(db.collection.aggregate({$map //or $reduce...})),但老实说,我并没有走得太远,因为这对我来说有点新,尽管在我看来这可能是要走的路。
有人做过类似的事情,可能在这里起作用吗?谢谢!!
答:
OP 指出,我们在问题中看到的文档是 (表面上是 ) 的结果,该结果可能将 、 和 值带到不断增长的数组(多达 50,000 个)上。如果目标是获得每种颜色的模式,则可以用作“多组”。假设每个单独的文档都有这样的形状(注意:这里只是为了简化而使用and;该方法扩展到文档中的任何其他字段):$group
color
$push
fruit
vegetable
meat
$facet
fruit
meat
vegetable
{"color": "blue", "fruit": "A", "meat": "X"}
以下管道将生成我们寻求的模式:$facet
db.foo.aggregate([
{$facet: {
"most_fruit": [
// Sum by color and fruit name:
{$group: {_id: {c:'$color', v:'$fruit'}, N: {$sum: 1}}}
// Reorg by color only....
,{$group: {_id: '$_id.c', X:{$push: {v:'$_id.v',N:'$N'}}}}
// ...and now sort highest-to-lowest and take the highest one.
// Nice thing is if you really want, you are already set up to
// capture, for example, the highest *nd* the lowest.
,{$project: {
X: {$first: {$sortArray: {input: '$X', sortBy: {'N':-1} }} }
}}
],
// Same thing ... but for meat
"most_meat": [
{$group: {_id: {c:'$color', v:'$meat'}, N: {$sum: 1}}}
,{$group: {_id: '$_id.c', X:{$push: {v:'$_id.v',N:'$N'}}}}
,{$project: {
X: {$first: {$sortArray: {input: '$X', sortBy: {'N':-1} }} }
}}
]
}}
]);
生成具有此形状的东西:
{
most_fruit: [
{_id: 'blue', X: {v: 'A', N: 2} },
{_id: 'green',X: {v: 'F', N: 3} }
],
most_meat: [
{_id: 'green',X: {v: 'Z', N: 4} },
{_id: 'blue', X: {x: 'X', N: 3} }
]
}
返回包含所有信息的单个文档。虽然它不是按颜色组织的,但无法在数据库端(使用 MQL)进行进一步的处理,以使数据的分组或过滤更加高效;现在由客户端来设置信息。
下面是一个候选的客户端重组:
var oneDoc = c.next();
function processItem(obj, fname) {
for(var n = 0; n < obj[fname].length; n++) {
var cn = oneDoc[fname][n]['_id'];
if(undefined == color_major[cn]) {
// Set up ALL the possible modes. -1 is our way of
// signalling it has not yet been set.
color_major[cn] = {'most_fruit':-1,'most_meat':-1};
}
color_major[cn][fname] = oneDoc[fname][n]['X'];
}
}
var color_major = {};
processItem(oneDoc, 'most_fruit');
processItem(oneDoc, 'most_meat');
print(color_major);
收益 率:
{
green: { most_fruit: { v: 'F', N: 3 }, most_meat: { v: 'Z', N: 4 } },
blue: { most_fruit: { v: 'A', N: 2 }, most_meat: { v: 'X', N: 3 } }
}
如果你真的想打败DB..
(...但不是真的,因为上面$facet
阶段的数据非常小......
下面是用于重新格式化数据的额外管道:
// Turn $facet field names (lval) into values (rval):
,{$project: {X: {$objectToArray: '$$ROOT'} }}
// Double unwind is OK because there is only #color X 2 (fruit and meat) entries.
// Even if 1000 colors and fruit meat and veg and whatevs, still quite doable:
,{$unwind: '$X'}
,{$unwind: '$X.v'}
// Reorg on color:
,{$group: {_id: '$X.v._id', W: {$push: {k: '$X.k', v: '$X.v.X'}} }}
// ...and put it all back together:
,{$replaceRoot: { newRoot: {$mergeObjects: [ {color:'$_id'}, {$arrayToObject: '$W'} ]}
}}
收益 率:
{
color: 'green',
most_fruit: {
v: 'F',
N: 3
},
most_meat: {
v: 'Z',
N: 4
}
}
{
color: 'blue',
most_fruit: {
v: 'A',
N: 2
},
most_meat: {
v: 'X',
N: 3
}
}
评论
color
fruit
color
fruit
$group
$push
fruit
meat