map for fast retreive and sort

7月 10 2014

之前写了几个map 算法，发现map比hash_map unordered_map都快，如果hash_map是因为正如群里所说的微软设计的一个比较挫的hash class，
那unordered_map慢就说不过去了吧，这可是STL，而且一开始以为unordered_set，这些都是C++ 11加进来的，但是set只能一个数据类型，我应用的时候
经常需要统计一个word的count就是一个pair了，所以还是unordered_map比较实用，用法和map一样，find和end()比，然后插入直接map[]=… 或者map.insert(make_pair(strng,int))这种的。
至于为啥慢我也不知道了。

另外还有unordered_multiset, unordered_multimap 是指key可以重复，对于multimap表示 , 对于multiset表示apple, apple 也即key可以重复的这种，一般很少用到

另外还有一个问题就是我需要对count进行排序咋办？

sort一般是对于里面容器是单个对象的，如果map不可以的，网上一个好的解决方案是赋给一个vector，里面是pair，然后根据pair的第二个成员 count来比较确定顺序，没有其他好的算法了，
unordered_map这种连key都是乱得，不然怎么交hash为杂凑：），map的key则是红黑树(有序二叉树)，是有序的,但我们需要的是count排序。
http://stackoverflow.com/questions/1367429/sorting-a-stdmap-by-value-before-output-destroy

想一个简单的例子，百度很多日志，很多query，每个query是一个关键词string，外加其他属性例如IP， userid之类的，我们统计top10 query，那么就需要用这个了，查询高效，n次hash，O(n),
然后进行heapsort，O(n+10logn)=O(n)，所以这里面先放到一个vector，然后根据value(count)进行建堆和调堆10次。这个算法在hash函数合理的情况下保证线性的时间复杂度，但是实际拿小数据测出来都
不如二叉树的查找效率#_#

这道题目和July著名博客的hash那篇的第一个例子很像，戳->http://blog.csdn.net/v_july_v/article/details/6256463

附上代码：

#include <iostream>
#include <sstream>
#include <fstream>
#include <vector>
#include <map>
#include <hash_map>
#include <string>
#include <ctime>
#include <unordered_map>
#include <algorithm>
using namespace std;

void hashmap_wordcount()
{
    ifstream fin("E:\\2013-03-08-NTUInternship\\MetaSLForHuman\\Data\\WuMinCollectedFunctionalOrtholog\\HumanSL-YG-FunctionalSim.txt");
    if(fin==NULL) return ;
    string line,splitword;
    hash_map<string,int> wordcount; 
    while(getline(fin,line,'\n'))
    {
        istringstream istr(line);
        while(istr>>splitword)
        {
            //int splitnum=atoi(splitword.c_str());
            hash_map<string,int>::iterator it=wordcount.begin();
            if((it=wordcount.find(splitword))!=wordcount.end())
                //wordcount[splitword]++;
                it->second++;
            else
                //wordcount[splitword]=1; 
                wordcount.insert(make_pair<string,int>(splitword,1));
        }
    }
    /*
    for(hash_map<string,int>::iterator it=wordcount.begin();it!=wordcount.end();it++)
    {
        cout<<it->first<<" "<<it->second<<endl;
    }*/
}
void map_wordcount()
{
    ifstream fin("E:\\2013-03-08-NTUInternship\\MetaSLForHuman\\Data\\WuMinCollectedFunctionalOrtholog\\HumanSL-YG-FunctionalSim.txt");
    if(fin==NULL) return ;
    string line,splitword;
    map<string,int> wordcount; 
    while(getline(fin,line,'\n'))
    {
        istringstream istr(line);
        while(istr>>splitword)
        {
            //int splitnum=atoi(splitword.c_str());
            map<string,int>::iterator it=wordcount.begin();
            if((it=wordcount.find(splitword))!=wordcount.end())
                it->second++;
            else
                //wordcount[splitword]=1;
                wordcount.insert(make_pair<string,int>(splitword,1));
        }
    }/*
    for(map<string,int>::iterator it=wordcount.begin();it!=wordcount.end();it++)
    {
        cout<<it->first<<" "<<it->second<<endl;
    }*/
}
bool SortComp(const pair<string, int> p1, const pair<string, int>p2)
{
    return p1.second<p2.second;
}
void unordered_map_wordcount()
{
    unordered_map<string, int> wordcount;
    ifstream infile("E:\\2013-03-08-NTUInternship\\MetaSLForHuman\\Data\\WuMinCollectedFunctionalOrtholog\\HumanSL-YG-FunctionalSim.txt");
    string line; string splitword;
    while(getline(infile,line))
    {
        istringstream istr(line);
        while(istr>>splitword)//space or tab split default
        {
            if(wordcount.find(splitword)!=wordcount.end())//found
            {
                wordcount[splitword]++;
            }
            else
            {
                wordcount[splitword]=1;
            }
        }
    }
    vector<pair<string,int>> wordcountvec(wordcount.begin(),wordcount.end());

    sort(wordcountvec.begin(),wordcountvec.end(),SortComp);

    /*
    //find maxcount;
    unordered_map<string, int>::iterator it=wordcount.begin(),maxit;
    int maxcount=it->second;
    it++;
    while(it!=wordcount.end())
    {
        if(it->second>maxcount)
            maxit=it;
        //cout<<it->first<<" "<<it->second<<endl;
        it++;
    }
    */
}


int main()
{
    clock_t start,finish;
    start=clock();
    map_wordcount();
    finish=clock();
    cout<<(double)(finish-start)/CLOCKS_PER_SEC<<endl;
    start=clock();
    hashmap_wordcount();
    finish=clock();
    cout<<(double)(finish-start)/CLOCKS_PER_SEC<<endl;

    start=clock();
    unordered_map_wordcount();
    finish=clock();
    cout<<(double)(finish-start)/CLOCKS_PER_SEC<<endl;

    return 0;
}

Posted by richard爱闹 - 7月 10 2014
如需转载，请注明：本文来自 Richard

留言