夜深了我還沒有休息，臨近過春節(jié)了，可是加班還是那么的多，心累了，好想找個(gè)地方休息休息，放松下自己 、、、
可是路還要走，生活還要繼續(xù)，洗把臉，我還要戰(zhàn)斗，生活就那點(diǎn)事、、、

一、倒排索引

倒排索引，是一種為了提高搜索效率而創(chuàng)建的索引，是一種數(shù)據(jù)結(jié)構(gòu)。在搜索索引中輸入關(guān)鍵詞，然后讓搜索引擎取互聯(lián)網(wǎng)采集包括關(guān)鍵詞的信息網(wǎng)絡(luò)列表返還給你，但是這個(gè)樣會(huì)耗費(fèi)比較長(zhǎng)的時(shí)間，但是如果在查詢之前，就知道關(guān)鍵詞已經(jīng)存在哪些網(wǎng)頁中，這樣的話那查詢的效率會(huì)更加的快速。

在本文章中通過mapreduce實(shí)現(xiàn)倒排索引的：
數(shù)據(jù)例子如下：
Facebook is very good and very nice
Google is very good too
Tencent is very good too

所要實(shí)現(xiàn)的效果：
Facebook 第1行數(shù)據(jù),索引位置0 :1;
Google 第2行數(shù)據(jù),索引位置0 :1;
Tencent 第3行數(shù)據(jù),索引位置0 :1;
and 第1行數(shù)據(jù),索引位置4 :1;
good 第1行數(shù)據(jù),索引位置3 :1;第2行數(shù)據(jù),索引位置3 :1;第3行數(shù)據(jù),索引位置3 :1;
is 第1行數(shù)據(jù),索引位置1 :1;第2行數(shù)據(jù),索引位置1 :1;第3行數(shù)據(jù),索引位置1 :1;
nice 第1行數(shù)據(jù),索引位置6 :1;
too 第2行數(shù)據(jù),索引位置4 :1;第3行數(shù)據(jù),索引位置4 :1;
very 第1行數(shù)據(jù),索引位置2 :1;第1行數(shù)據(jù),索引位置5 :1;第2行數(shù)據(jù),索引位置2 :1;第3行數(shù)據(jù),索引位置2 :1;

整個(gè)項(xiàng)目的結(jié)構(gòu)如下：

InvertedIndexKeyValue : 存放map和reduce中間輸出變量的函數(shù)（這里只實(shí)現(xiàn)了map的中間比變量）
InvertedIndexMain：整個(gè)項(xiàng)目的入口函數(shù)，實(shí)現(xiàn)配置map和reduce類的入口
InvertedIndexMapper：項(xiàng)目的map過程
InvertedIndexReducer：項(xiàng)目的reduce過程

項(xiàng)目目錄結(jié)構(gòu).png

分開介紹各個(gè)函數(shù)的結(jié)構(gòu)：

InvertedIndexKeyValue：主要是定義帶有中文描述的數(shù)據(jù)結(jié)構(gòu)，為map的中間結(jié)果使用，當(dāng)然這里面也可以添加reduce的中間結(jié)果，依據(jù)需求添加；

package InvertedIndex;

public class InvertedIndexKeyValue {
    //切分單詞，形成相應(yīng)的倒排索引的模式數(shù)據(jù)
    private String keywords;  //定義map中間結(jié)果的key
    private String describe;  //定義map中間結(jié)果的 value
    public void setKeywords(String keywords) {
        this.keywords = keywords;
    }
    public void setDescribe(String describe) {
        this.describe = describe;
    }
    public String getKeywords() {
        return keywords;
    }
    public String getDescribe() {
        return describe;
    }
}

InvertedIndexMapper：實(shí)現(xiàn)map的過程，主要數(shù)統(tǒng)計(jì)文件中每一行的數(shù)據(jù)中單詞出現(xiàn)的詞頻，形成<key,value>的鍵值對(duì)；其中key的構(gòu)成帶有兩部分：?jiǎn)卧~+單詞位于第幾行中出現(xiàn)；

package InvertedIndex;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.ObjectWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.net.URI;
import java.util.StringTokenizer;

public class InvertedIndexMapper {
    private static Text text= new Text();

    public static InvertedIndexKeyValue AnalysisKeyWord(String keyword,String i,String j){
        //分開英文單詞的函數(shù)，以空格進(jìn)行分詞的操作
        InvertedIndexKeyValue invertedIndexKeyValue=new InvertedIndexKeyValue();
        invertedIndexKeyValue.setKeywords(keyword+":"+"第"+i+"行數(shù)據(jù),索引位置"+j+" ");
        invertedIndexKeyValue.setDescribe("1");
        return invertedIndexKeyValue;
    }

    public static void delfile(Configuration conf, Path path) throws Exception{
        //每次重復(fù)執(zhí)行之前對(duì)相應(yīng)目錄進(jìn)行清除的操作，確保程序正常執(zhí)行
        FileSystem fs =FileSystem.get(new URI(path.toString()),conf);
        Path fspath = path;
        if(fs.exists(fspath)){
            fs.delete(fspath,true);
        }
    }

    public static class MapIndex extends Mapper<LongWritable,Text,Text,Text>{
        //先執(zhí)行計(jì)數(shù)的過程
        private Text valueof = new Text();
        private Text keyof = new Text();
        private int i=1;
        protected void map(LongWritable key, Text text, Context context) throws IOException, InterruptedException{
            StringTokenizer stringTokenizer = new StringTokenizer(text.toString());
            int index =0; //每個(gè)單詞位于每一行的數(shù)據(jù)的索引位置
            while(stringTokenizer.hasMoreTokens()){
                //對(duì)每一行進(jìn)行處理的過程
                InvertedIndexKeyValue invertedIndexKeyValue=InvertedIndexMapper.AnalysisKeyWord(stringTokenizer.nextToken(),String.valueOf(i),String.valueOf(index));
                keyof.set(invertedIndexKeyValue.getKeywords().toString());
                valueof.set(invertedIndexKeyValue.getDescribe().toString());
                context.write(keyof,valueof);
                index++;
            }
            i+=1;
        }
    }
}

InvertedIndexReducer：此函數(shù)由兩部分構(gòu)成，Combine和Reduce兩個(gè)類構(gòu)成，分別進(jìn)行reduce的過程；

Combine處理的過程是：map的輸出結(jié)果的部分，key的形式為：Facebook：第1行數(shù)據(jù),索引位置0，而value的形式為1，combine的過程為拆分出“：”之前的單詞進(jìn)行reduce的過程統(tǒng)計(jì)值啊每一行上單詞出現(xiàn)的詞頻；
Reduce處理的過程是：把Combine的處理結(jié)果按照單詞相同和相關(guān)描述進(jìn)行歸類，把相關(guān)描述按照串聯(lián)的情況寫到一起

package InvertedIndex;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class InvertedIndexReducer {
    // 完成詞頻統(tǒng)計(jì)的工作
    public static class ReduceIndex extends Reducer<Text,Text,Text,Text>{
        private Text keyindex = new Text();
        private Text valueindex = new Text();
        protected void reduce(Text key,Iterable<Text> value,Context context) throws IOException,InterruptedException{
            StringBuilder stringBuilder = new StringBuilder();
            for (Text va :value){
                stringBuilder.append(va.toString()+";");
            }
            keyindex.set(key);
            valueindex.set(stringBuilder.toString());
            context.write(keyindex,valueindex);
        }
    }

    public static class Combine extends Reducer<Text,Text,Text,Text>{
        private Text keyinfo = new Text();
        private Text valueinfo = new Text();
        protected void reduce(Text key,Iterable<Text> value,Context context) throws IOException,InterruptedException{
            //開始統(tǒng)計(jì)詞頻的過程
            int sum = 0 ;
            for (Text text:value){
                sum += Integer.parseInt(text.toString()); //對(duì)value的值進(jìn)行統(tǒng)計(jì)處理
            }
            //重新設(shè)計(jì)key和value的值，sum座位value的值，對(duì)key值進(jìn)行拆分處理，一部分取出作為value的值進(jìn)行處理
            int spiltindex = key.toString().indexOf(":");
            keyinfo.set(key.toString().substring(0,spiltindex));
            valueinfo.set(key.toString().substring(spiltindex+1)+":"+sum);
            context.write(keyinfo,valueinfo);
        }
    }
}

InvertedIndexMain：負(fù)責(zé)的相應(yīng)的類的調(diào)用的過程

package InvertedIndex;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class InvertedIndexMain {
    //倒排索引的主函數(shù)
    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();
        Path path1 = new Path("InvertedIndex/InvertedIndexFile");
        //Path path2 = new Path(args[1]);
        Path path2 = new Path("outputInvertedIndex");
        InvertedIndexMapper.delfile(conf,path2);
        Job job= Job.getInstance(conf,"InvertedIndex");

        FileInputFormat.setInputPaths(job,path1);
        FileOutputFormat.setOutputPath(job,path2);

        job.setJarByClass(InvertedIndexMain.class);

        job.setMapperClass(InvertedIndexMapper.MapIndex.class);
        job.setCombinerClass(InvertedIndexReducer.Combine.class);
        job.setReducerClass(InvertedIndexReducer.ReduceIndex.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        //map函數(shù)搞定
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.waitForCompletion(true);
    }
}

執(zhí)行結(jié)果.png

PS：hadoop的本地環(huán)境的配置參考我上一篇文章?。。?/p>

加班過后，熬夜寫了今天的學(xué)習(xí)內(nèi)容，雖然累，但是覺得還是值得的，為自己加油?。?！

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

MapReduce算法模式-倒排索引模式

MapReduce算法模式-倒排索引模式

一、倒排索引

整個(gè)項(xiàng)目的結(jié)構(gòu)如下：

分開介紹各個(gè)函數(shù)的結(jié)構(gòu)：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

MapReduce算法模式-倒排索引模式

一、倒排索引

整個(gè)項(xiàng)目的結(jié)構(gòu)如下：

分開介紹各個(gè)函數(shù)的結(jié)構(gòu)：

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av