博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Daily Report 2012/11/07 陈伯雄(step 8)
阅读量:5301 次
发布时间:2019-06-14

本文共 4775 字,大约阅读时间需要 15 分钟。

  今天针对PIPE组对数据表的修改,对建立倒排索引做了系统的修改,由于表DOC、VEDIO、QUESTION(由QAPAIR修改为QUESTION)的属性并不完全相同,处理数据方法进行少量修改:

  DOC表和VEDIO表具有的相同属性:title;

  DOC独有属性:author,keywords;

  QUESTION独有属性:question;

  3个表最后的到的倒排索引结构式相同的,得到WORDLIST和对应ID;

  以下功能整合到分词模块和更新倒排索引模块中

//分词        static private List
getWords(int type, SqlDataReader reader) { List
listall = new List
(); if (type == 0) { string title = reader[_Title].ToString(); string keyword = reader[_KeyWords].ToString(); string author = reader[_Author].ToString(); //string description = reader[_Description].ToString(); List
list1 = ChineseWordSegmentation.word_segmentation(title); List
list2 = keyword.Split(new char[2] { ' ', ':' }, StringSplitOptions.RemoveEmptyEntries).ToList(); List
list3 = author.Split(new char[2] { ' ', '.' }, StringSplitOptions.RemoveEmptyEntries).ToList(); //List
list4 = ChineseWordSegmentation.word_segmentation(description); //listall = list1.Union(list2).Union(list3).Union(list4).ToList(); listall = list1.Union(list2).Union(list3).ToList(); } else if (type == 1) { string title = reader[_Title].ToString(); //string description = reader[_Description].ToString(); //List
list1 = ChineseWordSegmentation.word_segmentation(title); //List
list2 = ChineseWordSegmentation.word_segmentation(description); //listall = list1.Union(list2).ToList(); listall = ChineseWordSegmentation.word_segmentation(title); } else { string question = reader[_Question].ToString(); listall = ChineseWordSegmentation.word_segmentation(question); } return listall; } //更新倒排索引 static private void updateIndex(List
words, SqlConnection connection, string ID) { SqlCommand cmd = new SqlCommand(); cmd.Connection = connection; foreach (string word in words) { //倒排表中加入新关键词 cmd.CommandText = "SELECT value FROM index3 WHERE value = word"; object val = cmd.ExecuteScalar(); if (val == System.DBNull.Value) //if(cmd.ExecuteScalar() is DBNull) { cmd.CommandText = "INSERT INTO index3 VALUES(word, ID)"; cmd.ExecuteNonQuery(); } //倒排索引中存在的关键词,加上属性ID信息 else { string newValue = val.ToString() + "," + ID; cmd.CommandText = "UPDATE index3 SET value = newValue WHERE key = word"; cmd.ExecuteNonQuery(); } } }

主函数部分:

1  List
resultList = new List
(); 2 string connectionString = GetConnectionString(); //SQL Server链接字符串 3 using (SqlConnection connection = new SqlConnection(connectionString)) //SQL链接类的实例化 4 { 5 connection.Open(); //打开数据库 6 //建立倒排表 7 string sqlstr = "CREATE table index_doc(key varchar(50) primary key, ID varchar(50))"; 8 SqlCommand cmd = new SqlCommand(); 9 cmd.Connection = connection;10 cmd.CommandText = sqlstr;11 cmd.ExecuteNonQuery();12 sqlstr = "CREATE table index_vedio(key varchar(50) primary key, ID varchar(50))";13 cmd.CommandText = sqlstr;14 cmd.ExecuteNonQuery();15 sqlstr = "CREATE table index_question(key varchar(50) primary key, ID varchar(50))";16 cmd.CommandText = sqlstr;17 cmd.ExecuteNonQuery();18 19 for (int i = 0; i < 3;i++ )20 {21 string table = "";22 if (i == 0) table = _TableDoc;23 else if (i == 1) table = _TableVideo;24 else table = _TableQuestion;25 //读取顺序表26 sqlstr = "SELECT * FROM" + table;27 cmd.CommandText = sqlstr;28 SqlDataReader reader = cmd.ExecuteReader(); 29 try30 {31 while (reader.Read())32 {33 string ID = reader[_ID].ToString();34 //分词处理35 List
words = getWords(i, reader);36 //将keyword信息添加到倒排表37 updateIndex(words, connection, ID);38 }39 }40 finally41 {42 // Always call Close when done reading.43 reader.Close();44 }45 }

转载于:https://www.cnblogs.com/DOOM-scse/archive/2012/11/07/2759674.html

你可能感兴趣的文章
TCP粘包拆包问题
查看>>
Java中Runnable和Thread的区别
查看>>
SQL Server中利用正则表达式替换字符串
查看>>
POJ 1015 Jury Compromise(双塔dp)
查看>>
论三星输入法的好坏
查看>>
Linux 终端连接工具 XShell v6.0.01 企业便携版
查看>>
JS写一个简单日历
查看>>
LCA的两种求法
查看>>
Python 发 邮件
查看>>
mysql忘记密码的解决办法
查看>>
全面分析Java的垃圾回收机制2
查看>>
[Code Festival 2017 qual A] C: Palindromic Matrix
查看>>
修改博客园css样式
查看>>
Python3 高阶函数
查看>>
初始面向对象
查看>>
docker一键安装
查看>>
leetcode Letter Combinations of a Phone Number
查看>>
Exercise 34: Accessing Elements Of Lists
查看>>
angular中的代码执行顺序和$scope.$digest();
查看>>
ALS算法 (面试准备)
查看>>