Web Mining is the technique which is used to extract and discover the information from web documents and services automatically. The interest of various research communities, tremendous growth of information resources on Web and recent interest in e-commerce has made this area of research very huge. Web mining can be usually decomposed into subtasks.
- Resource finding: fetching intended web documents.
- Information selection and pre-processing: selecting and preprocessing specific information from fetched web resources automatically.
- Generalization: automatically discovers general patterns at individual and across multiple website
- Analysis: validation and explanation of mined patterns.
Web Mining can be mainly categorized into three areas of interest based on which part of Web needs to be mined: Web Content Mining, Web Structure Mining and Web Usage Mining.
Web Contents Mining describes the discovery of useful information from the web contents, data and documents. In past the internet consisted of only different types of services and data resources. But today most of the data is available over the internet; even digital libraries are also available on Web. The web contents consist of several types of data including text, image, audio, video, metadata as well as hyperlinks. Most of the companies are trying to transform their business and services into electronic form and putting it on Web. As a result, the databases of the companies which were previously residing on legacy systems are now accessible over the Web. Thus the employees, business partners and even end clients are able to access the company’s databases over the Web. Users are accessing the applications over the web via their web interfaces due to which the most of the companies are trying to transform their business over the web, because internet is capable of making connection to any other computer anywhere in the world. Some of the web contents are hidden and hence cannot be indexed. The dynamically generated data from the results of queries residing in the database or private data can fall in this area. Unstructured data such as free text or semi structured data such as HTML and fully structured data such as data in the tables or database generated web pages can be considered in this category. However unstructured text is mostly found in the web contents. The work on Web content mining is mostly done from 2 point of views, one is IR and other is DB point of view. From IR view, web content mining assists and improves the information finding or filtering to the user. From DB view web content mining models the data on the web and integrates them so that the more sophisticated queries other than keywords could be performed.
In Web Structure Mining, we are more concerned with the structure of hyperlinks within the web itself which can be called as inter document structure. It is closely related to the web usage mining. Pattern detection and graphs mining are essentially related to the web structure mining. Link analysis technique can be used to determine the patterns in the graph. The search engines like Google usually uses the web structure mining. For example, the links are mined and one can then determine the web pages that point to a particular web page. When a string is searched, a webpage having most number of links pointed to it may become first in the list. That’s why web pages are listed based on rank which is calculated by the rank of web pages pointed to it. Based on web structural data, web structure mining can be divided into two categories. The first kind of web structure mining interacts with extracting patterns from the hyperlinks in the web. A hyperlink is a structural component that links or connects the web page to a different web page or different location. The other kind of the web structure mining interacts with the document structure, which is using the tree-like structure to analyze and describe the HTML or XML tags within the web pages.
With continuous growth of e-commerce, web services and web applications, the volume of clickstream and user data collected by web based organizations in their daily operations has increased. The organizations can analyze such data to determine the life time value of clients, design cross marketing strategies etc. The Web usage mining interacts with data generated by user’s clickstream. The web usage data includes web server access logs, proxy server logs, browser logs, user profile, registration data, user sessions, transactions, cookies, user queries, bookmark data, mouse clicks and scrolls and any other data as a result of interaction. So the web usage mining is the most important task of the web mining. Weblog databases can provide rich information about the web dynamics. In web usage mining, web log records are mined to discover the user access patterns through which the potential customers can be identified, quality of internet services can be enhanced and web server performance can be improved. Many techniques can be developed for implementation of web usage mining but it is important to know that success of such applications depends upon what and how much valid and reliable knowledge can be discovered the log data. Most often, the web logs are cleaned, condensed and transformed before extraction of any useful and significant information from weblog. Web mining can be performed on web log records to find associations patterns, sequential patterns and trend of web accessing. The overall Web usage mining process can be divided into three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis. In the data collection & preprocessing stage, the raw data is collected, cleaned and transformed into a set of user transactions which represents the activities of each user during visits to the web site. In the pattern discovery stage, statistical, database, and machine learning operations are performed to retrieve hidden patterns representing the typical behavior of users, as well as summary of statistics on Web resources, sessions, and users.