A model to implement text mining on social networking sites – Patel Pratik M

Technical Paper Title: A model to implement text mining on social networking sites

Authors:Patel Pratik M, 2nd year BTech,IT

College: Shah and Anchor Kutchhi Engineering College, Chembur(E), Mumbai

Abstract:

This papers aims at demonstrating a system that establishes relationships from the information that has been mined from non-rule based unstructured data. For the purposes of this study, posts on social networking websites have been mined for such data. This paper puts forth a basic model for such a text mining system where relationships are established based on the keywords extracted from such posts. The extracted relationships and the ontology of a related domain form the base for a query analysis process.

Keywords: relationship extraction, text mining, non rule based unstructured data, query analysis, social network analysis

1. Introduction:

Text mining has widely emerged as a means to derive knowledge from unstructured data, especially data available on the World Wide Web. The issues regarding the same have been discussed in [5] which defines the concept of web mining and develops a system for the same. Sennellart and Blondel [4] give ways for discovery of similar words from the WWW corpus. However both these works have been based on text that follows a set of standard syntax of English language grammar. The information mined is therefore based on predetermined relationships using certain rules and ontologies. But, with the advent of social networking websites, a lot of information is in a non rule based textual format. The usefulness of this information to determine social behavior is demonstrated in [7]. In view of this development, it becomes essential to extract behavioral patterns from relationships established using these data sources rather than predefined associations in ontologies. This paper aims to propose a mechanism to mine for social behavior through extraction of relationships from data available on social forums.

2. Related work:

In [2], the author suggests a text mining system that obtains the relationship between the topics of international conferences. This experiment promises that the method works not only for obtaining the relationship between topics of conferences, but also for discovering the relationship between information entities that users are interested in.

Another paper[3] represents an unsupervised model for learning arbitrary relations between concepts of a molecular biology ontology for the purpose of supporting text mining and manual ontology building. Relations between named-entities are learned from the GENIA corpus by means of several standard natural language processing techniques.

Both [4] and [5] have given insights on work done on the WWW corpus for text mining based on ontological systems.

3. Overview:

The paper covers the text mining system in detail. Section 4 discusses the ontology used for the system. Section 5 discusses the data source for raw unstructured data. Section 6 and the subsections discuss the implementation of the system along with a brief description of each module. In section 7 and 8, the implications and extensibility issues are discussed. Section 9 consists of a discussion about the scope for improvements in the work.

4. Ontology description:

An ontology defines a set of representational primitives used to model a domain of knowledge. The definition of ontology that has been widely accepted is given by Gruber in [1], “An ontology is an explicit specification of some topics. It is a formal and declarative representation, which includes the vocabulary (or names) for referring to the terms in a specific subject area and the logical statements that describe what terms are, how they are related to each other.” For developing the ontology,  a JAVA based ontology editor and knowledge based framework called Protégé [10] which provides a plug-and-play environment that makes it a flexible base for rapid prototyping and application development has been used.

Using Protégé, an ontology primarily based on movies which aims to provide a controlled vocabulary to semantically describe movie related concepts such as Movie, Genre, Director, Actor, etc has been created. and also demonstrates the relationships existing between the various concepts. The movie ontology however has a limited scope at this stage of development. An ontology based on emotions and language used on social networking sites is also under development and the plan is to merge the movie ontology with the social behavior ontology to analyze reactions towards domain entities on social forums.

5. Data sources:

Large pool of highly unstructured non rule based data is available on social networking websites. This data is retrieved using targeted searches and stored in datasets. This data is mined to create indices and form relationships between concepts.

6. Implementation:

In this section we will examine the process of building indices and establishing relationships which will later be useful during query analysis. We follow an unsupervised model for learning arbitrary relations between concepts of the ontology as advocated in [3].

Figure 1 Overview of the system

6.1 Overview of the system:

Figure 1 gives a brief overview of the system. User pages/posts are retrieved and stored in local data repositories. The stored documents are filtered to eliminate superfluous information. The retained pages are indexed based on the page ID and keywords extracted from the content of the pages.

The keywords are clustered and relationships are established between the clusters thus formed. The former step requires reference to the ontology described in section 4 while inferences from the latter would be reflected in the ontology. The query engine would process user queries. It has been depicted in Figure 2.

6.2 Collection of relevant data

Data for this purpose is retrieved using API provided by the social networking site or using spiders as explained in [5] and is stored in large text repositories locally. The data exists in unstructured format and consists of both redundant as well as

irrelevant information. It is therefore necessary to purge the data for relevant sources only. This is done through the process of filtering explained in detail in [2].

6.3 Building indices based on keywords

This is a two step process. The first step involves parsing the filtered documents for keywords based on a ontology of words pertaining to the domain. The next step involves building indices mapping the document ids to particular keyword occurrences. Such an index is useful in the retrieval process in section 6.6.

6.4 Clustering

In this step, the keywords are grouped into clusters based on their closeness with related keywords to form clusters of conceptual relativity. This can be done through several methods some of which have been described in [6].

6.5 Relationship establishment

After the creation of conceptual clusters, the system combines them with one or more common topics. The commonality between topics is derived from their positions in ontology by depicting relationships as dependency paths. The output would consist of a set of templates involving pairs of ontology classes and a semantic relation. [3]

6.6 Query Analysis:

The query analysis engine is modeled in the following manner:

Figure 2 Query Analysis Engine

Figure 2 portrays the functioning of the query engine.

The system analyzes a query by mapping the keywords to the ontology classes defined during clustering and tracing its dependency onto related concepts from the associations developed in section 6.5.

The pseudo code for the query analysis process is as follows:

TERMINOLOGY

K=SET OF KEYWORDS, K = {j}

x Î C, if x is an entity that is an instance of class C

S Ì C, if S is a subclass of C in the ontology

S:P=V,  if P is the property of the subclass S and V is the set of values of P

ID(x, y),  if a post consists of a join between keywords x and y

PSEUDOCODE

READ USER_QUERY

PARSE USER_QUERY

IF (WORD Î (SET OF ENTITIES IN ONTOLOGY))

K=K{WORD}

x Î K, SET OF RELATED WORDS = {y Î x |  yÌC  xÌC | y: (yÎV  x=P)}

LIST OF RESULTS= {ID(x, y)}

To demonstrate the analysis of a simple query, we shall consider an example.

QUERY: “Which comedy movies are enjoyed the most

The processing of this query is described in brief in stages 6.6.1 to 6.6.5.

6.6.1 Keyword Extraction

Keyword extraction is the process in which the query is processed for words that are infrequent and yield meaning to the entire query. This process involves skipping frequently used words and conjunctions such as of, the, on, etc. , locating proper nouns and words which are included in the domain ontology. This is achieved by the process of parsing. For a non-rule based data source soft parsing or shallow parsing is used.

For the purpose of the query considered above, the keywords extracted are comedies, movies, enjoyed. Of these keywords, the keyword comedies is extracted by the soft parsing technique due to its matching with comedy which already exists in the ontology. Similar extraction strategy is used on the other keywords as well. Hence the keywords that are used for the next stage are comedy, movie, enjoy.

6.6.2 Ontology Search

The extracted keywords are mapped to an extensive ontology that is based on the relationship of other concepts related to the domain. The search capabilities based on ontologies is demonstrated in [8]. Similar mapping is performed by looking up an index based on results of preprocessing empirical data. The stage retrieves <keyword: Post ID NO> from the index for further processing.

The keywords extracted in the above step will be mapped to the ontology as follows.

comedy is a value of the genre property of the class movie. The entities having genre: comedy are instances of movies under the comedy category.

movie is a generic class defining several entities.

enjoy belongs to the subclass happiness under the class sentiment.. The other keywords under the subclass happiness are cool, great, fun, smiling, excited, glad, joy, love etc.

For each keyword above complying with the conditions above, suitable <keyword: Post ID no.> pairs are generated.

6.6.3 Relationship mapping

The query engine establishes relationships between concepts similar to the keyword entity to give an extended meaning to the search. This is done by generating joins between the keywords and checking for existence of posts containing the keywords of the join. The results thus generated carry pages not just based on the keywords of the query but concepts surrounding them as well. A detailed process in which such a system generates ontology classes-semantic relations pairs is described for a molecular biology ontology in [3].

In the case of our example, let us consider a pool of the following user posts/pages.

Post ID NO. 23: I loved DCH. Won’t mind watching it again. 🙂

Post ID NO. 31: Couldn’t stop smiling while watching Lage Raho

Post ID NO. 46: Wake up SID is out. Its really cool.

In all the above cases, the system looks up words related to the keyword enjoy within user posts containing instances of movie having value of the genre property as comedy such as DCH, Lage Raho, Wake Up Sid.

6.6.4 Filtering

The search results also contain ambiguous or redundant pages that are not relevant to the keyword. These results are filtered out on the basis of their conceptual closeness to the keywords keeping only a fixed number of pages to be displayed as search results.

In our example, the filtered pages would be one of the following:

1) Results that do not satisfy joins: RDB is a great movie.

Although great is similar to the keyword enjoy, RDB does not have its genre property set as comedy.

2)      Similarity between keywords is below the empirical value or 0.

The similarity between two keywords under the subclass happiness may be below the empirical value set or updated. An example could be (enjoy, vigor) whose similarity when calculated by Zamir and Etzioni’s method as described in [11] will yield a value lower than the empirical value. Posts containing vigor as a substitute for enjoy will therefore be eliminated.

6.6.5 Ranking

The search results are ranked on the basis of their relevancy and displayed accordingly. Appropriate mathematical models have been shown in [9] that rank domain terms while detecting changes. This is particularly essential when the mined data is from a frequently updated source such as section 5.

6.7 User Interface

The results of the query are displayed in a user interface that also consists of related keywords and concepts.

7. Implications:

In such a system, methods such as clustering of like concepts are used to form associations. Thus, complex relationships can be established within the dataset, which goes beyond the scope of interpretation from an ontology. A model that can successfully extract relationships from pool of non rule based data can find applications in a variety of fields such as semantic search, sentiment analysis, life analytics and social network analysis.

8. Extensibility:

Although the Hindi movie domain is selected, it is possible to extend the project domain to various other relevant topics by building the respective ontologies. Also an ontology providing higher level of detailing can be developed which could further be used to encompass larger domain of topics of interest. Different algorithms can be utilized to improve performance depending on the parameters that are of utmost importance. For example, in case of a social networking site where the posts are being regularly updated, speed of the algorithm is given higher preference than its accuracy. It can be implemented across various social networking sites and blogs to map social behavior and analyze current trends.

9. Discussions:

The data is being mined from a frequently updated source i.e. posts in case of a social networking site. Therefore a system that updates associations between entities based on these additions is desirable. The extraction of relationships based on new content can modify earlier dependency yielding better results. By using a query analysis engine that maps relations onto an ontology, the system has the potential to display better semantic search capabilities.

On the other hand, semantically, the relationships between keywords are dependent largely on the contexts in which they are used. Hence the results displayed, in certain cases, could be inaccurate as entities can be related through multiple contexts and links.  Moreover, the language used on social networking sites is, to a large extent, non rule based i.e. it does not follow the syntactical rules of the English language. This requires the development of a highly branched and detailed ontology which can be quite difficult. Additionally for such a complex system, it is difficult to predict the performance under various load conditions and requires considerable detail of study once the system is implemented.

References:

[1] T. Gruber, Ontology Definition, http://www-ksl.stanford.edu/kst/what-is-an-ontology.html

[2] Mine, T., Lu, S., and Amamiya, M. 2002. Discovering Relationships between Topics of Conferences by Filtering, Extracting and Clustering. In Proceedings of the 13th international Workshop on Database and Expert Systems Applications (September 02 – 06, 2002). DEXA. IEEE Computer Society, Washington, DC, 205-209.

[3] Ciaramita, M., Gangemi, et. al 2008. Unsupervised Learning of Semantic Relations for Molecular Biology Ontologies. In Proceeding of the 2008 Conference on ontology Learning and Population: Bridging the Gap between Text and Knowledge P. Buitelaar and P. Cimiano, Eds. Frontiers in Artificial Intelligence and Applications, vol. 167. IOS Press, Amsterdam, The Netherlands, 91-104.

[4] S. Pierre and Blondel V., Automatic Discovery of Similar Words , Survey of Text Mining: Clustering, Classification, and Retrieval Springer (Ed.) (2008), Page 25

[5]Castellano M., Mastronardi G., Aprile A., and Tarricone G., A Web Text Mining Flexible Architecture,

International Journal of Computer Science and Engineering, Volume 1 Number 4 summer 2007, Page 252

[6] Tsekouras G., Dimitris P., et. al, Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data , International Conference on Computational Intelligence, ICCI 2004, December 17-19, 2004, Istanbul, Turkey, Proceedings 2004

[7] Java, A., Song, X., Finin, T., and Tseng, B. 2007. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis (San Jose, California, August 12 – 12, 2007).

[8] Holger Bast, Fabian Suchanek, Ingmar Weber, “Semantic Full-Text Search with ESTER: Scalable, Easy, Fast,” Data Mining Workshops, International Conference on, pp. 959-962, 2008 IEEE International Conference on Data Mining Workshops, 2008.

[9] Enkhsaikhan, M., Wong, W., Liu, W., and Reynolds, M. 2007. Measuring data-driven ontology changes using text mining. InProceedings of the Sixth Australasian Conference on Data Mining and Analytics – Volume 70 (Gold Coast, Australia, December 03 – 04, 2007)

[10] Protégé home page: http://protege.stanford.edu/

[11] Oren Zamir and Oren Etzioni. Web document clustering : A feasiblity demonstration. In Proceedings of the 21st Intl.ACM SIGIR Conference, pages46–54, 1998.