ASP.NET技巧之?dāng)?shù)據(jù)采集程序淺析
ASP.NET技巧之?dāng)?shù)據(jù)采集程序介紹開(kāi)始首先我們來(lái)看看一點(diǎn)概念,所謂的數(shù)據(jù)采集程序也就是網(wǎng)頁(yè)小偷程序(大家別罵我哦),寫(xiě)完了來(lái)這里發(fā)點(diǎn)東西,希望大家有何高見(jiàn)共同研究.
ASP.NET技巧之?dāng)?shù)據(jù)采集程序***步,在下載數(shù)據(jù)的開(kāi)始,有些網(wǎng)站是要登錄了才能看到相應(yīng)的數(shù)據(jù),這個(gè)就需要我們發(fā)送登錄用戶(hù)名和密碼了,但我是登錄了,但他服務(wù)器也不是垃圾,在他那里重定向了,共產(chǎn)生了2個(gè)SESSION,這第2個(gè)SESSION我就不知道如何捕抓.于是我就投機(jī)^-^,用軟件將SESSION捕抓下來(lái)了1個(gè)叫Ethereal的軟件,用以下代碼加入到HTTP請(qǐng)求的頭部
- WebClient myWebClient = new WebClient();
- string sessionkey=textBox78.Text;
- string refererurl=textBox77.Text;
- myWebClient.Headers.Clear();
- myWebClient.Headers.Add("Cookie",sessionkey);
- myWebClient.Headers.Add("Referer", refererurl);
- myWebClient.Headers.Add("User-agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.5) Gecko/20031107 Debian/1.5-3");
這樣就欺騙了服務(wù)器了,哈哈
ASP.NET技巧之?dāng)?shù)據(jù)采集程序第二步,代碼下載
- byte[] myDataBuffer = myWebClient.DownloadData(remoteUri);
- download = Encoding.Default.GetString(myDataBuffer);
ASP.NET技巧之?dāng)?shù)據(jù)采集程序第三步,數(shù)據(jù)的匹配了,我是將流讀取到數(shù)據(jù)里,然后用IndexOf得到2個(gè)關(guān)鍵字段的位置,然后用Substring取出來(lái)的,我知道這很笨,但用正則表達(dá)式難啊(誰(shuí)會(huì)的指點(diǎn)我下),匹配完了得到的字符串我就用以下的函數(shù)去掉了HTML代碼:
- private string StripHTML(string strHtml)
- {
- string [] aryReg ={
- @"<script[^>]*?>.*?</script>",
- @"<(\/\s*)?!?((\w+:)?\w+)(\w+(\s*=?\s*(([""'])(\\[""'tbnr]|[^\7])*?\7|\w+)|.{0})|\s)*?(\/\s*)?>",
- @"([\r\n])[\s]+",
- @"&(quot|#34);",
- @"&(amp|#38);",
- @"&(lt|#60);",
- @"&(gt|#62);",
- @"&(nbsp|#160);",
- @"&(iexcl|#161);",
- @"&(cent|#162);",
- @"&(pound|#163);",
- @"&(copy|#169);",
- @"&#(\d+);",
- @"-->",
- @"<!--.*\n"
- };
- string [] aryRep = {
- "",
- "",
- "",
- "\"",
- "&",
- "<",
- ">",
- " ",
- "\xa1",//chr(161),
- "\xa2",//chr(162),
- "\xa3",//chr(163),
- "\xa9",//chr(169),
- "",
- "\r\n",
- ""
- };
- string newReg =aryReg[0];
- string strOutput=strHtml;
- for(int i = 0;i<aryReg.Length;i++)
- {
- Regex regex = new Regex(aryReg[i],RegexOptions.IgnoreCase );
- strOutput = regex.Replace(strOutput,aryRep[i]);
- }
- strOutput.Replace("<","");
- strOutput.Replace(">","");
- strOutput.Replace("\r\n","");
- return strOutput;
- }
到了后面就是入庫(kù)了,這個(gè)大家都懂了吧.但是我還有點(diǎn)問(wèn)題就是,在我寫(xiě)數(shù)據(jù)的時(shí)候,出了EXCEPTION,說(shuō)我的字段太長(zhǎng)了,不能寫(xiě)進(jìn)到數(shù)據(jù)庫(kù),我用的是ACCESS,我試驗(yàn)下用SQL吧.
ASP.NET技巧之?dāng)?shù)據(jù)采集程序就向你介紹到這里,希望對(duì)你使用ASP.NET書(shū)寫(xiě)數(shù)據(jù)采集程序有點(diǎn)幫助。
【編輯推薦】