|
我在PostgreSQL表中有一堆文本行,我正在尝试查找常见的字符串。% U5 Q4 I9 g" `- a* a$ D; Y
例如,假设我有一个基本表,例如:% r2 W& Q/ K o8 r7 i; H
CREATE TABLE a (id serial, value text);
4 ]; S& a) _/ ^( i( R% iINSERT INTO a (value) VALUES
8 }1 V4 `. V6 g0 A r ('I go to the movie theater'), & m0 n, l0 ~, a9 M. L, v2 E
('New movie theater releases'),
# @" ?/ E4 {1 H9 U ('Coming out this week at your local movie theater'),
1 F, a4 H# ~( _# i" ?% q* W ('New exposition about learning disabilities at the children museum'),2 `# \/ Q- z$ p* S
('The genius found in learning disabilities')
0 |2 O5 _7 ~- ?: C1 j) p- ?, p;
1 S( K$ U& z3 h ]+ R, {" g我正在尝试在所有行中movie theater以及learning
- i- F6 f: \. y O' R2 Zdisabilities所有行中找到流行的字符串(目标是显示“趋势”字符串之类的列表,例如Twitter“趋势”之王)
" M5 E1 |/ x" P我使用全文搜索,并且尝试ts_stat结合使用,ts_headline但是结果令人失望。$ N6 a- n7 x% M/ n) y+ W
有什么想法吗?谢谢!8 b0 y# J3 p) _2 |# T7 L" T1 Q4 R& d
7 @& l8 ~2 [1 V2 }/ O& \. ~
解决方案:
- G5 J3 j3 y. s0 {0 r4 Q* j; B + q# K$ b; U+ I/ Q" Y: L
/ g8 P6 P: _8 Q k$ M( D
& `+ Q, u3 ]. {& K) }9 y, k4 j. S 没有现成的Posgres文本搜索功能可以找到最受欢迎的短语。对于两个单词的短语,您可以ts_stat()用来查找最流行的单词,消除质点,介词等,并将这些单词交叉连接以查找最流行的单词对。; d' Z, r t" u# v
对于实际数据,您需要更改标记为--> parameter.的值。对于较大的数据集,查询可能会非常昂贵。* C! \9 I: \, ^8 A
with popular_words as (
! j' j9 X0 I/ B& C5 R. B select word
& E/ [ Y7 [. ~+ ] from ts_stat('select value::tsvector from a')
& ? M5 I( `( |6 T where nentry > 1 --> parameter: P5 _- U g$ ^
and not word in ('to', 'the', 'at', 'in', 'a') --> parameter4 I9 W- P; t; l2 z: i% a
)
$ f3 B% M0 M$ a7 ~, C- mselect concat_ws(' ', a1.word, a2.word) phrase, count(*) , a$ _) a3 v* C, L3 K
from popular_words as a1& M* u* E5 R7 A/ s0 z/ K, H5 v4 |) `" t/ f
cross join popular_words as a2) {8 A: ~* N. y
cross join a
7 \" C! Z/ o' @ r2 i6 _" m- N5 Vwhere value ilike format('%%%s %s%%', a1.word, a2.word)+ K+ l: g. V: ^ I! I
group by 1
$ \ ]# W. o' R0 T- t5 qhaving count(*) > 1 --> parameter) t9 J: n: A- Q9 W, c* j# D
order by 2 desc;
: z: U3 g6 D/ ~! B2 L' M: l- `: L6 u- t( _
phrase | count % k2 }# {& y7 p- P
-----------------------+-------
$ L4 H/ z# o; _& C movie theater | 3
) [1 h: d9 a3 j, O( `% i& `7 [ learning disabilities | 2( F7 b; R, j" O
(2 rows) |
|