Gaining Insight into User and Search Engine Behaviour by Analyzing Web Logs - Softcover

Jose, Dr. Jeeva

 
9783960670872: Gaining Insight into User and Search Engine Behaviour by Analyzing Web Logs

Inhaltsangabe

Web Usage Mining, also known as Web Log Mining, is the result of user interaction with a Web server including Web logs, click streams and database transaction or the visits of search engine crawlers at a Website. Log files provide an immense source of information about the behavior of users as well as search engine crawlers. Web Usage Mining concerns the usage of common browsing patterns, i.e. pages requested in sequence from Web logs. These patterns can be utilized to enhance the design and modification of a Website. Analyzing and discovering user behavior is helpful for understanding what online information users inquire and how they behave. The analyzed result can be used in intelligent online applications, refining Websites, improving search accuracy when seeking information and lead decision makers towards better decisions in changing markets, for instance by putting advertisements in ideal places. Similarly, the crawlers or spiders are accessing the Websites to index new and updated pages. These traces help to analyze the behavior of search engine crawlers.
The log files are unstructured files and of huge size. These files need to be extracted and pre-processed before any data mining functionality to follow. Pre-processing is done in unique ways for each application. Two pre-processing algorithms are proposed based on indiscernibility relations in rough set theory which generates Equivalence Classes. The first algorithm generates a pre-processed file with successful user requests while the second one generates a pre-processed file for pre-fetching and caching purposes. Two algorithms are proposed to extract usage analytics. The first algorithm identifies the origin of visits, the top referring sites and the most popular keywords used by the visitor to arrive at a Website. The second algorithm extracts user agents like browsers and operating systems used by a visitor to access a Website.
In this study, clustering of users based on Entry Pages to a Website is done to analyze the deep linked traffic at a Website. The Top Ten Entry Pages, the traffic and the temporal information of the Top Ten Entry Pages are also studied.

Die Inhaltsangabe kann sich auf eine andere Ausgabe dieses Titels beziehen.

Über die Autorin bzw. den Autor

Prof. Jeeva Jose was awarded PhD in Computer Science from Mahatma Gandhi University, Kerala, India and is a faculty member at BPC College, Kerala. Her passion is teaching and areas of interests include World Wide Web, Data Mining and Cyber laws. She has been in higher education since year 2000 years and has completed three research projects funded by UGC and KSCSTE. She has authored and published five books. She has published more than twenty research papers in various refereed journals and conference proceedings. She has edited three books and has given many invited talks in various conferences. She is a recipient of ACM-W Scholarship provided by Association for Computing Machinery, New York.

Auszug. © Genehmigter Nachdruck. Alle Rechte vorbehalten.

Text sample:
Chapter 2: Pre-processing of Web Logs and Web Usage Analytics:
Web Usage Mining needs tremendous amount of pre-processing before any data mining functionality to follow. The pre-processing will remove irrelevant records which otherwise may affect the mining results. This chapter is divided into 2 sections namely pre-processing of Web logs and Web usage analytics. Two pre-processing algorithms are proposed based on indiscernibility relations in rough set theory which generates Equivalence Classes. The first algorithm pre-processes the raw file for further identification of users and user sessions. The second algorithm pre-processes the log file and gives the pages accessed, ist frequency and total bytes transferred. Two algorithms are proposed to extract usage analytics. The first algorithm identifies the origin of user visits, top referring sites and most popular keywords used by the visitor to arrive at a Website. The second algorithm extracts browsers with ist version and operating system with ist version used by various visitors to access a Website. The browser and operating system are together known as user agents. All algorithms are tested on two different data sets and the results are displayed.
2.1: Pre-processing of Web Logs:
The need for pre-processing is explained in section 1.3. The advantages of pre-processing include the elimination of considerable amount of space needed to store irrelevant records and the precision of mining results can be improved. This Chapter deals with pre-processing of Web log files related to mine user behavior and hence all the search engine crawler requests, unsuccessful requests, other irrelevant requests containing .jpg, .mpg, .gif, .png, .txt, .wav etc. are removed. The indiscernibility relation in rough set theory is used for pre-processing [234Jose12] [240Jose12]. Table 2.1 shows various status codes of Hyper Text Transfer Protocol [27indicating response status.
2.1.1: Indiscernibility Relations in Rough Set Theory:
A rough set based feature selection for Web Usage Mining is used in [94Inbarani07]. The experimental result shows the importance of the Web data pre-processing and it reduces the size of the log file. Feature selection is a preprocessing step in data mining and is very effective in reducing dimensions. Feature selection process refers to choose a subset of attributes from the set of original attributes. The purpose of feature selection is to identify the significant features, eliminate the irrelevant of dispensable features to the learning task and build a good learning model. The indiscernibility relation in rough set theory is used for clustering in [95Hirano05]. The main advantage of this method is that it can be applied to proximity measures that do not satisfy the triangular inequality and very well handles relative proximity. Relative proximity is a class of proximity measures that is suitable for representing subjective similarity or dissimilarity such as the degree of likeness between people. Indiscernibility relations in rough set theory [96Pawalak02] can be used for the data cleaning of Web log files. Rough set is based on the assumption that with every object of the universe of discourse, some information is associated. Objects characterized by the same information are indiscernible (similar) in view of the available information about them. Any set of all indiscernible (similar) objects is called an elementary set and forms a basic granule of knowledge about the universe. Any union of some elementary sets is referred to as crisp (precise) set otherwise the set is rough (imprecise, vague).
Let a given pair S= (U,A) of non-empty finite sets U and A, where U is the Universe of objects and A is the set consisting of attributes. The function a: U Va , where Va is the set of values of attribute a called the domain of a. The pair S=(U,A) is called an information system. Any information system can be represented by a data t

„Über diesen Titel“ kann sich auf eine andere Ausgabe dieses Titels beziehen.