Tiny Search Engine : Design and implementation - PowerPoint PPT Presentation

Loading...

PPT – Tiny Search Engine : Design and implementation PowerPoint presentation | free to download - id: 70c542-ZGM0O



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Tiny Search Engine : Design and implementation

Description:

Title: Figure 15.1 A distributed multimedia system Author: George Coulouris Last modified by: YAN Hongfei Created Date: 6/18/2000 9:59:47 PM Document presentation format – PowerPoint PPT presentation

Number of Views:1
Avg rating:3.0/5.0
Slides: 39
Provided by: George693
Learn more at: http://net.pku.edu.cn
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Tiny Search Engine : Design and implementation


1
Tiny Search Engine Design and implementation
  • YAN Hongfei (???)
  • yhf_at_net.cs.pku.edu.cn
  • Network Group
  • Oct. 2003

2
Outline
  • analysis
  • which deals with the design requirements and
    overall architecture of a system
  • design
  • which translates a system architecture into
    programming constructs (such as interfaces,
    classes, and method descriptions)
  • and programming
  • which implements these programming constructs.

3
Defining System Requirements and Capabilities
  • Supports capability to crawl pages multi-threadly
  • Supports persistent HTTP connection
  • Supports DNS cache
  • Supports IP block
  • Supports the capability to filter unreachable
    sites
  • Supports the capability to parse links
  • Supports the capability to crawl recursively
  • Supports Tianwang-format output
  • Supports ISAM output
  • Supports the capability to enumerate a page
    according to a URL
  • Supports the capability to search a key word in
    the depot

4
Three main components of the Web
  • HyperText Markup Language
  • A language for specifying the contents and layout
    of pages
  • Uniform Resource Locators
  • Identify documents and other resources
  • A client-server architecture with HTTP
  • By with browsers and other clients fetch
    documents and other resources from web servers

5
HTML
ltIMG SRC http//www.cdk3.net/WebExample/Images/e
arth.jpggt ltPgt Welcome to Earth! Visitors may also
be interested in taking a look at the ltA HREF
http//www.cdk3.net/WebExample/moon.htmlgtMoonlt/Agt
. lt/Pgt (etcetera)
  • HTML text is stored in a file of a web server.
  • A browser retrieves the contents of this file
    from a web server.
  • -The browser interprets the HTML text
  • -The server can infer the content type from the
    filename extension.

6
URL
Scheme scheme-specific-location e.g mailtojoe_at_
anISP.net ftp//ftp.downloadIt.com/software/aProg
.exe http//net.pku.cn/ .
  • HTTP URLs are the most widely used
  • An HTTP URL has two main jobs to do
  • To identify which web server maintains the
    resource
  • To identify which of the resources at that server

7
HTTP URLs
  • http//servernameport//pathNameOnServer?argum
    ents
  • e.g.
  • http//www.cdk3.net/
  • http//www.w3c.org/Protocols/Activity.html
  • http//e.pku.cn/cgi-bin/allsearch?worddistributed
    system
  • --------------------------------------------------
    --------------------------------------------------
  • Server DNS name Pathname on server Arguments
  • www.cdk3.net (default) (none)
  • www.w3c.org Protocols/Activity.html (none)
  • e.pku.cn cgi-bin/allsearch
    worddistributedsystem
  • --------------------------------------------------
    --------------------------------------------------
    ---

8
HTTP
  • Defines the ways in which browsers and any other
    types of client interact with web servers
    (RFC2616)
  • Main features
  • Request-replay interaction
  • Content types. The strings that denote the type
    of content are called MIME (RFC2045,2046)
  • One resource per request. HTTP version 1.0
  • Simple access control

9
More features-services and dynamic pages
  • Dynamic content
  • Common Gateway Interface a program that web
    servers run to generate content for their clients
  • Downloaded code
  • JavaScript
  • Applet

10
What we need?
  • Intel x86/Linux (Red Hat Linux) platform
  • C
  • .

Linus Torvalds
11
Get the homepage of PKU site
  • webg_at_BigPc telnet www.pku.cn 80 ???????80???
  • Trying 162.105.129.12... ?Telnet????
  • Connected to rock.pku.cn (162.105.129.12). ?Telnet
    ????
  • Escape character is ''. ?Telnet????
  • GET / ?????????
  • lthtmlgt Web?????????
  • ltheadgt
  • lttitlegt????lt/titlegt
  • ????????????
  • lt/bodygt
  • lt/htmlgt
  • Connection closed by foreign host. ?Telnet????

12
Outline
  • analysis
  • which deals with the design requirements and
    overall architecture of a system
  • design
  • which translates a system architecture into
    programming constructs (such as interfaces,
    classes, and method descriptions)
  • and programming
  • which implements these programming constructs.

13
Defining system objects
  • URL
  • ltschemegt//ltnet_locgt/ltpathgtltparamsgt?ltquerygtltfrag
    mentgt
  • ??scheme??,????????URL??????
  • scheme "" ????.
  • "//" net_loc ????/???,????.
  • "/" path URL ??.
  • "" params ????.
  • "?" query ????.
  • Page
  • .

14
Class URL
  • class CUrl
  • public
  • string m_sUrl // URL??
  • enum url_scheme m_eScheme // URL
    scheme ???
  • string m_sHost // ????
  • int m_nPort // ???
  • / URL components (URL-quoted). /
  • string m_sPath, m_sParams,
    m_sQuery, m_sFragment
  • / Extracted path info (unquoted). /
  • string m_sDir, m_sFile
  • / Username and password (unquoted). /
  • string m_sUser, m_sPasswd
  • public
  • CUrl() CUrl()
  • bool ParseUrl( string strUrl )

15
CUrlCurl()
  • CUrlCUrl()
  • this-gtm_sUrl ""
  • this-gtm_eScheme SCHEME_INVALID
  • this-gtm_sHost ""
  • this-gtm_nPort DEFAULT_HTTP_PORT
  • this-gtm_sPath ""
  • this-gtm_sParams ""
  • this-gtm_sQuery ""
  • this-gtm_sFragment ""
  • this-gtm_sDir ""
  • this-gtm_sFile ""
  • this-gtm_sUser ""
  • this-gtm_sPasswd ""

16
CUrlParseUrl
  • bool CUrlParseUrl( string strUrl )
  • stringsize_type idx
  • this-gtParseScheme( strUrl.c_str( ) )
  • if( this-gtm_eScheme ! SCHEME_HTTP )
  • return false
  • // get host name
  • this-gtm_sHost strUrl.substr(7)
  • idx m_sHost.find('/')
  • if( idx ! stringnpos )
  • m_sHost m_sHost.substr( 0, idx
    )
  • this-gtm_sUrl strUrl
  • return true

17
Defining system objects
  • URL
  • ltschemegt//ltnet_locgt/ltpathgtltparamsgt?ltquerygtltfrag
    mentgt
  • ??scheme??,????????URL??????
  • scheme "" ????.
  • "//" net_loc ????/???,????.
  • "/" path URL ??.
  • "" params ????.
  • "?" query ????.
  • Page
  • .

18
Class Page
  • public
  • string m_sUrl string
    m_sLocation
  • string m_sHeader int
    m_nLenHeader
  • string m_sCharset string
    m_sContentEncoding string m_sContentType
  • string m_sContent int
    m_nLenContent
  • string m_sContentLinkInfo
  • string m_sLinkInfo4SE int
    m_nLenLinkInfo4SE
  • string m_sLinkInfo4History int
    m_nLenLinkInfo4History
  • string m_sContentNoTags
  • int m_nRefLink4SENum int
    m_nRefLink4HistoryNum
  • enum page_type m_eType
  • RefLink4SE m_RefLink4SEMAX_URL_REFERENCES
  • RefLink4History m_RefLink4HistoryMAX_URL_
    REFERENCES/2
  • mapltstring,string,lessltstringgt gt
    m_mapLink4SE
  • vectorltstring gt m_vecLink4History

19
Class Page continued
  • public
  • CPage()
  • CPageCPage(string strUrl, string
    strLocation, char header, char body, int
    nLenBody)
  • CPage()
  • int GetCharset()
  • int GetContentEncoding()
  • int GetContentType()
  • int GetContentLinkInfo()
  • int GetLinkInfo4SE()
  • int GetLinkInfo4History()
  • void FindRefLink4SE()
  • void FindRefLink4History()
  • private
  • int NormallizeUrl(string strUrl)
  • bool IsFilterLink(string plink)

20
Sockets used for streams
21
????????????????
  • DNS??
  • URL????,?????????
  • ?????????????
  • ???????????????????
  • ?????????,???????,???????
  • ???????????? ?CERNET???????????
  • ????????
  • ????connect???,?????????
  • ????,????

22
?????????????? 1/3
  • ????
  • int HttpFetch(string strUrl, char fileBuf, char
    fileHeadBuf, char location, int nPSock)
  • ??? http//fetch.sourceforge.net??int
    http_fetch(const char url_tmp, char fileBuf)
  • ????,?????,??

23
??header?? 2/3
  • ????
  • int HttpFetch(string strUrl, char fileBuf, char
    fileHeadBuf, char location, int nPSock)
  • e.g.
  • HTTP/1.1 200 OK  Date Tue, 16 Sep 2003
    141915 GMT  Server Apache/2.0.40 (Red Hat
    Linux)  Last-Modified Tue, 16 Sep 2003 131819
    GMT  ETag "10f7a5-2c8e-375a5cc0"  Accept-Ranges
    bytes  Content-Length 11406  Connection
    close  Content-Type text/html charsetGB2312

24
??body?? 3/3
  • ????
  • int HttpFetch(string strUrl, char fileBuf, char
    fileHeadBuf, char location, int nPSock)
  • e.g.
  • lthtmlgt
  • ltheadgt
  • ltmeta http-equiv"Content-Language"
    content"zh-cn"gt
  • ltmeta http-equiv"Content-Type"
    content"text/html charsetgb2312"gt
  • ltmeta name"GENERATOR" content"Microsoft
    FrontPage 4.0"gt
  • ltmeta name"ProgId" content"FrontPage.Editor.Docu
    ment"gt
  • lttitlegtComputer Networks and Distributed
    Systemlt/titlegt
  • lt/headgt
  • .

25
??????????
  • ???????1-10ms,???10-1000Mbps
  • Internet????100-500ms,???0.010-2 Mbps
  • ?????????????,?????????????
  • ??????????????,???,?????????,
  • ???????????????Internet????????

26
????????????????????????Robot? 1/2
  • ?????
  • ??????????13KB
  • ??????100Mbps???????,???????????100,?????????(1.0
    e8b/s)/ (1500B8b/B) 8333????,??????8333???
  • ????????Internet????100Mbs,Internet???????
    50(???????80,???????????),??????????????4000??
  • ??n??????????,???????Robot??????4000/n?

27
????????????????????????Robot? 2/2
  • ???
  • ?????????????????,????CPU?????????,??CPU????????50
    ,???????????80,?????????,??????????
  • ?????????????100Mbps????,??????Internet????100Mbps
    (??????????,??????),???Robot????1000??
  • ?????Robot???????????(http//e.pku.cn/ )?????

28
???????
  • ??????????????????461500?????
  • ?????????RTT?200ms?????,???????SPT?100ms,??TCP????
    ?????500ms(2 RTTSPT)?
  • ???????????????,???1?????????(13KB/1500B) 500ms
    4s?
  • ???????????100?Robot??,???????????(24 60
    60s/4s) 100 2,160,000????
  • ???Robot???????????,??????????,???????????2,160,00
    0???/??

29
TSE??????
  • ??????????????URL??????
  • ????????????????
  • ??WWW?????,?????????TCP?????????,????TCP??????????
    ??
  • ???????????,??????,??????????????????(Denial of
    service)??????

30
????????????
  • ?????????URL??
  • ISAM????
  • ???????URL?,??WebData.idx?????????,???,????URL?
  • .md5.visitedurl
  • E.g. 0007e11f6732fffee6ee156d892dd57e
  • .unvisit.tmp
  • E.g.
  • http//dean.pku.edu.cn/zhaosheng/????2001?????????
    ?.files/filelist.xml
  • http//mba.pku.edu.cn/Chinese/xinwenzhongxin/xwzx.
    htm
  • http//mba.pku.edu.cn/paragraph.css
  • http//www.pku.org.cn/xyh/oversea.htm

31
???IP????
  • ??4???
  • ???,???,???,???????????????,
  • ????????????????
  • ???????
  • ???DNS??
  • ??????????????

32
ISAM
  • ???????isam??????,
  • ??????????(WebData.dat)
  • ???????(WebData.idx)
  • ??????????????????????url
  • ????
  • int isamfile(char buffer, int len)

33
Enumerate a page according to a URL
  • ????WebData.dat?WebData.idx????url????????????????
    ????
  • ????
  • int FindUrl(char url,char buffer,int
    buffersize)

34
Search a key word in the depot
  • ??WebData.dat????????????,??????????????
  • ????
  • void FindKey(char key)
  • ????????url??key???????,??????????????????????????
    ???????

35
Tianwang format output
  • a raw page depot consists of records, every
    record includes a raw data of a page, records are
    stored sequentially, without delimitation between
    records.
  • a record consists of a header(HEAD), a data(DATA)
    and a line feed ('\n'), such as is HEAD blank
    line DATA '\n
  • a header consists of some properties. Each
    property is a non blank line. Blank line is
    forbidden in the header.
  • a property consists of a name and a value, with
    delimitation "".
  • the first property of the header must be the
    version property, such as version 1.0
  • the last property of the header must be the
    length property, such as length 1800
  • for simpleness, all names of properties should be
    in lowercase.

36
Summary
  • Supports capability to crawl pages multi-threadly
  • Supports persistent HTTP connection
  • Supports DNS cache
  • Supports IP block
  • Supports the capability to filter unreachable
    sites
  • Supports the capability to parse links
  • Supports the capability to crawl recursively
  • Supports Tianwang-format output
  • Supports ISAM output
  • Supports the capability to enumerate a page
    according to a URL
  • Supports the capability to search a key word in
    the depot

37
TSE package
  • http//net.pku.edu.cn/webg/src/TSE/
  • nohup ./Tse c seed.pku
  • To stop crawling process
  • ps ef
  • Kill ???

38
Thank you for your attention!
About PowerShow.com