|
我有一些数据SQLite在数据库中,它有224000行。我想从中提取时间序列信息,以提供数据可视化工具。本质上,数据库中的每一行都是一个以秒为单位的时间(严格期以来)的事件-* N1 ~: ~* ^6 r/ Y" k
日期组和的日期组和名称(除其他严格相关事项外)。我想提取数据库中每周有多少事件。5 s' s% I! t9 z* s
这很简单:
2 z1 Y& T1 W0 W, i S# SSELECT COUNT(*), name, strf("%W:%Y",time,"unixepoch") FROM events GROUP BY strf("%W:%Y",time,"unixepoch"),name ORDER BY time我们得到了大约6000行数据+ V. w! E; F# R8 F! A
count name week:year 23............ fudge.......23:2009 etc...但我不想每周为每个名字排名-我想每个名字都有一行,每周都有一行。7 e" t& \ k" B) W5 b
Name 23:2009 24:2009 25:2009fudge........23............6............19 fish.........1.............0............12 etc...现在,监控过程已经运行了69周,唯一的名字是502。显然,我并不热衷于任何解决方案来解决所有列的硬编码。我对迭代了解不多,比如使用python的executemany(),但必要时我愿意接受。SQL是明智的,该死的。
+ D! w( M- J. q4 `0 H! Z2 J5 D
5 V; {. _; j' e+ S, _: H* j- N 解决方案:
' i/ s3 m L( w6 ~ 在这种情况下,一个好方法是不要SQL推到令人费解、难以理解和维护的地步。SQL尽力而为,然后在那里Python后处理查询结果。
2 F! V8 I2 F5 I6 P! S2 g- u这是我编写的简单交叉表生成器的简化版。完整版提供行/列/总。
# Q1 R/ E0 P" A3 {2 R! p你会注意到它有内置的分组依据-汇总使用原始用例Python和xlrd从Excel从文件中获得的数据。
+ R: ^) u1 v5 i4 s! n0 p' D5 C您提供的row_key和col_key字符串不必像例子中那样。它们可以是元组(例如(year,week)你的情况),也可以是整数(比如你有字符串列到整数排序键的映射)。
2 w! P) w' H' g) a) O4 {- ximport sysclass CrossTab(object): def __init__( self, missing=0,# what to return for an empty cell. Alternatives: '',0.0,None,'NULL self.missing = missing self.col_key_set = set() self.cell_dict = self.headings_OK = False def add_item(self,row_key,col_key,value): self.col_key_set.add(col_key) try: self.cell_dict[row_key][col_key] = value except KeyError: try: self.cell_dict[row_key][col_key] = value except KeyError: self.cell_dict[row_key] = {col_key: value} def _process_headings(self): if self.headings_OK: return self.row_headings = list(sorted(self.cell_dict.iterkeys())) self.col_headings = list(sorted(self.col_key_set)) self.headings_OK = True def get_col_headings(self): self._process_headings() return self.col_headings def generate_row_info(self): self._process_headings() for row_key in self.row_headings: row_dict = self.cell_dict[row_key] row_vals = [row_dict.get(col_key,self.missing) for col_key in self.col_headings] yield row_key,row_vals def dump(self,f=None,header=None,footer=,if f is None: f = sys.stdout alist = self.__dict__.items() alist.sort() if header is not None: print >> f,header for attr,value in alist: print >> f,"%s: %r" % (attr,value) if footer is not None: print >> f,footerif __name__ == "__main__": data = Rob','Morn Rob','Aft Joe','Morn Joe','Aft Jill','Morn Jill','Aft Rob','Aft Rob','aft 5]Dozy','Aft # Dozy doesn't show up till lunch-time Nemo','never',-1] NAME,TIME,AMOUNT = range(3) xlate_time = {'morn': "AM","aft": "M"} print ctab = CrossTab(missing=None,) # ctab.dump(header='=== after init === for s in data: ctab.add_item( row_key=s[NAME], col_key= xlate_time.get(s[TIME].lower(),"XXXX"), value=s[AMOUNT]) # ctab.dump(header='=== after add_item === print ctab.get_col_headings() # ctab.dump(header='=== after get_col_headings === for x in ctab.generate_row_info()print x输出:
9 Y q# r4 Q% [! ^2 [7 J- p$ _['AM','PM','XXXX']('Dozy',[None,1,None])('Jill',[100,150,None])('Joe',[70,80,None])('Nemo',[None,None,-1])('Rob',[240,345,None]) |
|