Article PDF Available

Analyzing Social Media through Big Data using InfoSphere BigInsights and Apache Flume

Abstract and Figures

Social Media provides organizations ability to survey feelings towards the contents and events associated to them in real time. Moreover, the first demarche of the sentiment analysis is the pre-processing of data collected from Social Media. Most of existing research works that deals with social media analysis based on extracting new features related to sentiment. This paper presents the usage of Twitter in a number of proposed subjects, which is the largest social networking website where Twitter data is in increasing at higher rates every day that considers it as Big Data Source. Then, describing in detail the way in which Big data technology, such as, InfoSphere BigInsights enables processing of this data, which are primarily collected from social networks by Apache Flume and stored in Hadoop storage. In addition, we have investigated a Big Data platform for collecting social media data based on Apache Flume and analyzing this data using InfoSphere BigInsights. Moreover, our paper integrates the visualization of these analysis results using BigSheets. To that end, evaluation through analysis of results confirms that the proposed Big Data platform produces better results in terms of social media analysis.

Apache Flume architecture.
Flume configuration files for Twitter data.

Content may be subject to copyright.

ResearchGate Logo

Discover the world's research

  • 20+ million members
  • 135+ million publications
  • 700k+ research projects

Join for free

Content may be subject to copyright.

ScienceDirect

Available online at www.sciencedirect.com

Procedia Computer Science 113 (2017) 280–285

1877-0509 © 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

10.1016/j.procs.2017.08.299

10.1016/j.procs.2017.08.299

© 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

1877-0509

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2017) 000 000

www.elsevier.com/locate/procedia

1877-0509 © 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks

(EUSPN 2017)

Analyzing Social Media through Big Data using InfoSphere

BigInsights and Apache Flume

Marouane Birjalia,

, Abderrahim Beni-Hssanea, Mohammed Erritalib

aLAROSERI Laboratory, Department of Computer Sciences, University of Chouaib Doukkali, Faculty of Sciences, El Jadida, Morocco

bTIAD Laboratory, University of Sultan Moulay Slimane, Faculty of Sciences and Technologies, Béni Mellal, Morocco

Abstract

Social Media provides organizations ability to survey feelings towards the contents and events associated to them in real time.

Moreover, the first demarche of the sentiment analysis is the pre-processing of data collected from Social Media. Most of existing

research works that deals with social media analysis based on extracting new features related to sentiment. This paper presents the

usage of Twitter in a number of proposed subjects, which is the largest social networking website where Twitter data is in increasing

at higher rates every day that considers it as Big Data Source. Then, describing in detail the way in which Big data technology,

such as, InfoSphere BigInsights enables processing of this data, which are primarily collected from social networks by Apache

Flume and stored in Hadoop storage. In addition, we have investigated a Big Data platform for collecting social media data based

on Apache Flume and analyzing this data using InfoSphere BigInsights. Moreover, our paper integrates the visualization of these

analysis results using BigSheets. To that end , e valuation through analysis of results confirms that the proposed Big Data platform

produces better results in terms of social media analysis.

© 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

Keywords: Big Data; Hadoop; Infosphere BigInsights; Social Media Analysis; BigSheets; Twitter Data; Apache Flume.

1. Introduction

Today, the companies face growing challenges from their commercial perspective. In particular, their adding value

should be produced from huge amount of data generated and also on the data complexity that can be in structured,

* Corresponding author.

E-mail address: birjali.marouane@gmail.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2017) 000 000

www.elsevier.com/locate/procedia

1877-0509 © 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks

(EUSPN 2017)

Analyzing Social Media through Big Data using InfoSphere

BigInsights and Apache Flume

Marouane Birjalia, *, Abderrahim Beni-Hssanea, Mohammed Erritalib

aLAROSERI Laboratory, Department of Computer Sciences, University of Chouaib Doukkali, Faculty of Sciences, El Jadida, Morocco

bTIAD Laboratory, University of Sultan Moulay Slimane, Faculty of Sciences and Technologies, Béni Mellal, Morocco

Abstract

Social Media provides organizations ability to survey feelings towards the contents and events associated to them in real time.

Moreover, the first demarche of the sentiment analysis is the pre-processing of data collected from Social Media. Most of existing

research works that deals with social media analysis based on extracting new features related to sentiment. This paper presents the

usage of Twitter in a number of proposed subjects, which is the largest social networking website where Twitter data is in increasing

at higher rates every day that considers it as Big Data Source. Then, describing in detail the way in which Big data technology,

such as, InfoSphere BigInsights enables processing of this data, which are primarily collected from social networks by Apache

Flume and stored in Hadoop storage. In addition, we have investigated a Big Data platform for collecting social media data based

on Apache Flume and analyzing this data using InfoSphere BigInsights. Moreover, our paper integrates the visualization of these

analysis results using BigSheets. To that end , e valuation through analysis of results confirms that the proposed Big Data platform

produces better results in terms of social media analysis.

© 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

Keywords: Big Data; Hadoop; Infosphere BigInsights; Social Media Analysis; BigSheets; Twitter Data; Apache Flume.

1. Introduction

Today, the companies face growing challenges from their commercial perspective. In particular, their adding value

should be produced from huge amount of data generated and also on the data complexity that can be in structured,

Corresponding author.

E-mail address: birjali.marouane@gmail.com

Marouane Birjali et al. / Procedia Computer Science 113 (2017) 280–285 281

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2017) 000 000

www.elsevier.com/locate/procedia

1877-0509 © 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks

(EUSPN 2017)

Analyzing Social Media through Big Data using InfoSphere

BigInsights and Apache Flume

Marouane Birjalia, *, Abderrahim Beni-Hssanea, Mohammed Erritalib

aLAROSERI Laboratory, Department of Computer Sciences, University of Chouaib Doukkali, Faculty of Sciences, El Jadida, Morocco

bTIAD Laboratory, University of Sultan Moulay Slimane, Faculty of Sciences and Technologies, Béni Mellal, Morocco

Abstract

Social Media provides organizations ability to survey feelings towards the contents and events associated to them in real time.

Moreover, the first demarche of the sentiment analysis is the pre-processing of data collected from Social Media. Most of existing

research works that deals with social media analysis based on extracting new features related to sentiment. This paper presents the

usage of Twitter in a number of proposed subjects, which is the largest social networking website where Twitter data is in increasing

at higher rates every day that considers it as Big Data Source. Then, describing in detail the way in which Big data technology,

such as, InfoSphere BigInsights enables processing of this data, which are primarily collected from social networks by Apache

Flume and stored in Hadoop storage. In addition, we have investigated a Big Data platform for collecting social media data based

on Apache Flume and analyzing this data using InfoSphere BigInsights. Moreover, our paper integrates the visualization of these

analysis results using BigSheets. To that end , e valuation through analysis of results confirms that the proposed Big Data platform

produces better results in terms of social media analysis.

© 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

Keywords: Big Data; Hadoop; Infosphere BigInsights; Social Media Analysis; BigSheets; Twitter Data; Apache Flume.

1. Introduction

Today, the companies face growing challenges from their commercial perspective. In particular, their adding value

should be produced from huge amount of data generated and also on the data complexity that can be in structured,

* Corresponding author.

E-mail address: birjali.marouane@gmail.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2017) 000 000

www.elsevier.com/locate/procedia

1877-0509 © 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks

(EUSPN 2017)

Analyzing Social Media through Big Data using InfoSphere

BigInsights and Apache Flume

Marouane Birjalia, *, Abderrahim Beni-Hssanea, Mohammed Erritalib

aLAROSERI Laboratory, Department of Computer Sciences, University of Chouaib Doukkali, Faculty of Sciences, El Jadida, Morocco

bTIAD Laboratory, University of Sultan Moulay Slimane, Faculty of Sciences and Technologies, Béni Mellal, Morocco

Abstract

Social Media provides organizations ability to survey feelings towards the contents and events associated to them in real time.

Moreover, the first demarche of the sentiment analysis is the pre-processing of data collected from Social Media. Most of existing

research works that deals with social media analysis based on extracting new features related to sentiment. This paper presents the

usage of Twitter in a number of proposed subjects, which is the largest social networking website where Twitter data is in increasing

at higher rates every day that considers it as Big Data Source. Then, describing in detail the way in which Big data technology,

such as, InfoSphere BigInsights enables processing of this data, which are primarily collected from social networks by Apache

Flume and stored in Hadoop storage. In addition, we have investigated a Big Data platform for collecting social media data based

on Apache Flume and analyzing this data using InfoSphere BigInsights. Moreover, our paper integrates the visualization of these

analysis results using BigSheets. To that end , e valuation through analysis of results confirms that the proposed Big Data platform

produces better results in terms of social media analysis.

© 2017 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the Conference Program Chairs.

Keywords: Big Data; Hadoop; Infosphere BigInsights; Social Media Analysis; BigSheets; Twitter Data; Apache Flume.

1. Introduction

Today, the companies face growing challenges from their commercial perspective. In particular, their adding value

should be produced from huge amount of data generated and also on the data complexity that can be in structured,

* Corresponding author.

E-mail address: birjali.marouane@gmail.com

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 2

semi -structured or unstructured. During the last years, the Internet has yet seen a wider scope through the development

of social media. Based on communication techniques and accessible to all, the media promote social interaction

through the Internet. Many social networks exist and there are more than 900 social media sites available on the

internet1 . Millions of people are using Twitter and it is ranked as one of the most visited sites with the averag e of 58

million tweets per day 2 . Big data is the border of the ability of an enterprise in term of storing, processing and accessing

all the data it needs for the effective functioning, and to make decisions, reduce risks, and also to serve the different

customers within a more reasonable time3 .

In addition, the first organizations that adopted great data were online and startup companies. Big Data has the

ability to reduce costs and substantial improvements in the time needed to perform a spreadsheet task4 . According to

the statistical in the industries, there are 2.5 million items added per minute by each individual. As well as, 300,000

tweets, 220,000 images and 200 million emails generate per minute and several companies as 5TB of RFID and are

bigger than that of 1PB for gas turbines with daily production. 2.8 zettabytes of all these data are only now in 2015

and by 2020, this large number can reach 40 zettabytes5,6,7 . In this world, 90% of unstructured data and is becoming

difficult to treat broadcast on for business. To gain the value of large data, another approach is needed to treat 8 .

This massive data is considered as a Big Data and can be used for industrial or business purpose after organizing

as per the need and processing. This work presents how to analyze data from social networks using BigSheets

BigInsights processing9,10 , a nalysis and visualization using Apache Hadoop Distributed 11,12 and Apache Flume13 for

the collection of this data. Among the many Big Data technologies, Hadoop is popular and more used to meet the

challenges of Big Data14 . There are many different platforms that give Hadoop the spread of their data like Apache

Hadoop12 , IBM BigInsights9,10 , Microsoft Azure HD Insights15 , Cloudera 16 and Hortonworks tools17. These tools

perform data analysis and processing functions depending on th e different problem areas14 .

This paper is organized as follows. Related works on the proposed work in section 2 and our methodology of work

is presented in section 3. In section 4, the problem statement and methodology in the existing tool is described. Finally,

the paper is summarized briefly in section 5.

2. Related works

In this chapter, we describe the related research and the study of social media analysis. The social networks have

engaged to attract the attention of the research field, which try to analyze, among others, the private life, the

interconnection and the interaction between users. People tend to express their feelings and talk about their activities

of daily life through Twitter.

There is a lot of research work on the analysis of feelings, rules-based techniques, bag-of-words and machine -

learning methods. Two main research directions of opinion mining operate on either the document level 18,19,20 or the

sentence level21,22,23 . Most methods of classifying the document at the level of sentences are usually based on the

identification of terms or phrases of opinion. For this, there are basically two types of methods: (1) lexicon -based

methods, and (2) rules-based methods. The treatment of sentiment analysis is part of natural language processing at

several levels of granularity. There is a wide range of research work on feel analysis24 , rule-based methods, bag-of-

words and machine-learning techniques. Based on a classification task at the level of document by Turney 25 , it was

processed in the level of sentences by Hu and Liu26 and more recently at level of sentences by Wilson27 .

The social network like Twitter, on which users post his reactions to and opinions about "everything", is a new and

different challenge. Some of the first results and recent analysis of Twitter sentiment data. Two main research areas

of mining opinion operate either on the document level 28 . Both classification methods at the document level and at the

level of the sentence are generally based on the identification of opinion words or phrases.

However, in this paper, we focus on social media data, on which users post real time reactions to and activities

about "everything", where this data is increasing at high rates every day is considered as Big Data14 . The processing

and analysis of this data is done using InfoSphere BigInsights9,10 , which brings performance power to Hadoop11,12 .

This also includes viewing the results of analysis of large data tables using large sheets and workbooks.

282 Marouane Birjali et al. / Procedia Computer Science 113 (2017) 280–285

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 3

3. Background

In this section, we present the background and detailed description of our research work. A data set is created using

social media posts of electronic products. Moreover, we perform a sentiment analysis based on sentence level in three

phases. In first is the preprocessing. Then a feature vector is created using relevant features. Finally, we use different

functions of BigSheets to classify into positive and negative classes .

3.1. Hadoop framework

Apache Hadoop is High-availability distributed framework that offers a distributed storage system via its HDFS

(Hadoop Distributed File System) and processing management system11,12 . Hadoop provides possibility to store data

offers in the duplicating. Therefore, Hadoop does not require to be configured with a RAID system because it becomes

useless with Hadoop. On the other hand, Hadoop offers data processing framework on large data volumes called

MapReduce29 .

The MapReduce architecture is composed of two phases of phases: the map and reduce phase. Initially, t he input

data can be divided into several copies as <key, value> where the key is the word and the value indicates how many

times the word has occurred and assigns to each underemployment the task trackers. Finally, in the phase interruption,

the results of each job tracker are combined to produce the finale results30 .

HDFS is highly fault tolerant, which is designed for low-cost hardware, holds up very large amount of data is stored

on multiple machines (Multi-nodes). Some of the important characteristics of HDFS is the storage and processing in

a distributed environment, streaming access to data files. At the level of data security, Hadoop provides itself with file

permissions and user authentication.

3.2. Apache Flume

Flume was originally developed by Cloudera before being donated to the Apache community13 . It is now called

Flume NG (Next Generation). Flume works as a distributed service for real-time data collection, temporary storage,

and delivery to a target31 . Flume is a highly reliable, distributed, and configurable tool. It is designed to collect

streaming data from several web-servers to HDFS. Technically, Flume agent creates routes to connect a source to a

target via a Flume channel, as shown in the following figure.

The source: Flume aims to retrieve messages from different sources, especially log files but also as we

will see from Twitter data.

The Flume channel: is a buffer that stores messages before they are consumed. Memory storage is

generally used.

The Flume target: batch consumes the messages coming from the channel to write them on a destination

like HDFS for example.

Fig. 1. Apache Flume architecture.

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 4

After the creation of the application on the official website of Twitter, we use the key and the secret of the consumer

as well as the access token and the secret values. In addition, we can access Twitter and we can collect Tweets as what

we want to collect. The following figure is the configuration file we used to collect Tweets from Twitter.

TwitterAgent.so urces = Twitte r

TwitterAgent. channels = Me mChannel

TwitterAgent.s inks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = aaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent.sources.Twitter.consumerSecret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent.sources.Twitter.accessToken = aaaaaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent.sources.Twitter.accessTokenSecret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent. sinks.HDFS. channel = MemChannel

TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.s inks.HDFS.hdfs.path = hdfs ://localhost: 8020/user/flume/tweets/

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent. channels.M emChannel.type = memory

TwitterAgent. channels.M emChannel.capacity = 1 0000

TwitterAgent. channels.M emChannel.transactio nCapacity = 100

Fig. 2. Flume configuration files for Twitter data.

3.3. BigInsights InfoSphere

IBM InfoSphere BigInsights is a Hadoop platform that offers new ways to use large volumes of data. In this paper,

we describe the most frequently used features of InfoSphere BigInsights, which allow us to analyze large volumes of

data from different sources and formats, in order to gather information that they might not have had before9 .

InfoSphere BigInsights provides the capabilities they need to meet the challenges of their business while ensuring

maximum compatibility with Hadoop. InfoSphere BigInsights includes a large number of IBM technologies that

increase the performance of Hadoop open source software to accelerate return on investment. InfoSphere BigInsights

offers a wide range of capabilities that go beyond the Hadoop capabilities, and IBM has ch osen an inclusion

approach9,10 . To do this, we quickly start our analysis of the data collected from social media in a Big Data

environment. InfoSphere BigInsights has been the subject of several improvements:

Accelerating deployments with innovations from the Hadoop community

Using Existing SQL Skills and Solutions

Enabling user-oriented analysis and data provisioning

4. Problem Statement and Methodology

4.1. Existing tools

As mentioned above, small set of social media data can be downloaded and easily treated through traditional

databases. Procedure using ancient techniques stream and analyze the raw data. However, the data can be huge in

quantity and unstructured raw data which traditional databases cannot handle, process and analyze. This is a wide

problem in the distribution and processing of large volumes of data in Big Data Sources in real time. Indeed, this

problem can be initiated by creating a dashboard to monitor the traffic of feelings in Twitter .

This article presents how to overcome the limitations of traditional techniques using the Hadoop ecosystem to

streamline the processing of data from large clusters. The tweets data are then analyzed from Flume, processed using

the Jaql script and stored in HDFS and analyzed, negative words are positive using methods for MapReduce and finally

it returns the result id. Tweet, and then displays the results as graphics using the BigSheets BigInsights tool32 .

Marouane Birjali et al. / Procedia Computer Science 113 (2017) 280–285 283

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 3

3. Background

In this section, we present the background and detailed description of our research work. A data set is created using

social media posts of electronic products. Moreover, we perform a sentiment analysis based on sentence level in three

phases. In first is the preprocessing. Then a feature vector is created using relevant features. Finally, we use different

functions of BigSheets to classify into positive and negative classes .

3.1. Hadoop framework

Apache Hadoop is High-availability distributed framework that offers a distributed storage system via its HDFS

(Hadoop Distributed File System) and processing management system11,12 . Hadoop provides possibility to store data

offers in the duplicating. Therefore, Hadoop does not require to be configured with a RAID system because it becomes

useless with Hadoop. On the other hand, Hadoop offers data processing framework on large data volumes called

MapReduce29 .

The MapReduce architecture is composed of two phases of phases: the map and reduce phase. Initially, t he input

data can be divided into several copies as <key, value> where the key is the word and the value indicates how many

times the word has occurred and assigns to each underemployment the task trackers. Finally, in the phase interruption,

the results of each job tracker are combined to produce the finale results30 .

HDFS is highly fault tolerant, which is designed for low-cost hardware, holds up very large amount of data is stored

on multiple machines (Multi-nodes). Some of the important characteristics of HDFS is the storage and processing in

a distributed environment, streaming access to data files. At the level of data security, Hadoop provides itself with file

permissions and user authentication.

3.2. Apache Flume

Flume was originally developed by Cloudera before being donated to the Apache community13 . It is now called

Flume NG (Next Generation). Flume works as a distributed service for real-time data collection, temporary storage,

and delivery to a target31 . Flume is a highly reliable, distributed, and configurable tool. It is designed to collect

streaming data from several web-servers to HDFS. Technically, Flume agent creates routes to connect a source to a

target via a Flume channel, as shown in the following figure.

The source: Flume aims to retrieve messages from different sources, especially log files but also as we

will see from Twitter data.

The Flume channel: is a buffer that stores messages before they are consumed. Memory storage is

generally used.

The Flume target: batch consumes the messages coming from the channel to write them on a destination

like HDFS for example.

Fig. 1. Apache Flume architecture.

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 4

After the creation of the application on the official website of Twitter, we use the key and the secret of the consumer

as well as the access token and the secret values. In addition, we can access Twitter and we can collect Tweets as what

we want to collect. The following figure is the configuration file we used to collect Tweets from Twitter.

TwitterAgent.sources = Twitter

TwitterAgent. channels = Me mChannel

TwitterAgent.s inks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = aaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent.sources.Twitter.consumerSecret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent.sources.Twitter.accessToken = aaaaaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent.sources.Twitter.accessTokenSecret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

TwitterAgent. sinks.HDFS. channel = MemChannel

TwitterAgent.sinks.HDFS.type = hdfs

TwitterAgent.s inks.HDFS.hdfs.path = hdfs ://localhost: 8020/user/flume/tweets/

TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream

TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000

TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

TwitterAgent. channels.M emChannel.type = memory

TwitterAgent. channels.M emChannel.capacity = 1 0000

TwitterAgent.channels.MemChanne l.transactionCapacity = 100

Fig. 2. Flume configuration files for Twitter data.

3.3. BigInsights InfoSphere

IBM InfoSphere BigInsights is a Hadoop platform that offers new ways to use large volumes of data. In this paper,

we describe the most frequently used features of InfoSphere BigInsights, which allow us to analyze large volumes of

data from different sources and formats, in order to gather information that they might not have had before9 .

InfoSphere BigInsights provides the capabilities they need to meet the challenges of their business while ensuring

maximum compatibility with Hadoop. InfoSphere BigInsights includes a large number of IBM technologies that

increase the performance of Hadoop open source software to accelerate return on investment. InfoSphere BigInsights

offers a wide range of capabilities that go beyond the Hadoop capabilities, and IBM has ch osen an inclusion

approach9,10 . To do this, we quickly start our analysis of the data collected from social media in a Big Data

environment. InfoSphere BigInsights has been the subject of several improvements:

Accelerating deployments with innovations from the Hadoop community

Using Existing SQL Skills and Solutions

Enabling user-oriented analysis and data provisioning

4. Problem Statement and Methodology

4.1. Existing tools

As mentioned above, small set of social media data can be downloaded and easily treated through traditional

databases. Procedure using ancient techniques stream and analyze the raw data. However, the data can be huge in

quantity and unstructured raw data which traditional databases cannot handle, process and analyze. This is a wide

problem in the distribution and processing of large volumes of data in Big Data Sources in real time. Indeed, this

problem can be initiated by creating a dashboard to monitor the traffic of feelings in Twitter .

This article presents how to overcome the limitations of traditional techniques using the Hadoop ecosystem to

streamline the processing of data from large clusters. The tweets data are then analyzed from Flume, processed using

the Jaql script and stored in HDFS and analyzed, negative words are positive using methods for MapReduce and finally

it returns the result id. Tweet, and then displays the results as graphics using the BigSheets BigInsights tool32 .

284 Marouane Birjali et al. / Procedia Computer Science 113 (2017) 280–285

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 5

4.2. Methodology

In this paper, we will solve the question of analyzing large data from social networks using Hadoop to simplify data

processing. We implemented the Apache Hadoop platform and IBM BigInsights for easy analysis of Big Data issues.

In this perspective, analysis of tweets involves moving on to the following steps to overcome problems in the

traditional system. The data collected is unstructured format. T he script JAQL33 is used to extract important data,

transforming them into a simpler structure to convert the delimited file by commas, and then storing the data in HDFS

to perform the processing using MapReduce30 . For the graphical visualizations and final data manipulations, we use

BigSheets. BigSheets is a spreadsheet-style tool provided with BigInsights InfoSphere to allow standard spreadsheets

functions, filter data, join tables, sort data, and visualize data in graphs32 . The Figure 3 above shows total percentage

coverage by languages. To achieve this result, we just group data from the Tweets data and provide the total count of

tweets in every group.

Fig. 3. Coverage by language in a pie chart.

The Figure 4 present the BigSheets Twitter Analysis Number of tweets during the time, and for the figure 5

depicts a tag cloud chart we generated for the words. As with any BigSheets tag cloud, larger font indicates more

occurrences of the data value and scrolling over a data value reveals the number of times it occurred in the collection.

Fig. 4. Visualization by BigSheets Twitter Analysis for Number of

tweets during the time.

Fig. 5. Visualization by BigSheets Twitter Analysis for Tag cloud.

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 6

5. Conclusion

As part of this work, we present a way of collecting social media data using Apache Flume, analyzing and

visualizing the twitter data using BigInsights InfoSphere . This platform is not only applicable for streaming,

processing, analyzing, and visualizing the twitter data but also to enhanced to apply other types of Big D ata from

various sources. This paper shows that processing time for analysis of massive twitter data by using the proposed work

when compared to other traditional processing methods for Big D ata.

References

1. Statistic Brain. Twitter Statistics, Retrieved from http://www.statisticbrain.com/ twitter-statistics/, 2014.

2. R. Li, K. H. Lei,R. Khadiwala, Chang. TEDAS: A Twitter-based Event Detection and Analysis System. icde, pp.1273 -1276, 2012 IEEE 28th

International Conference on Data Engineering, 2012.

3. Peter Lake, P aul Crowther. Concise Guide to Databases: A Practical Introduction. Springer-Verlag London 2013.

4. Thomas H. Davenport. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities. Harvard Business Review Press,

5. D. Terrana, A. Augello, and G. Pilato. Automatic unsupervised polarity detection on a Twitter data stream. in Proc. IEEE Int. Conf. Semantic

Comput., Newport Beach, CA, USA, Sep. 2014, pp. 128 134.

6. http://en.wikipedia.org/wiki/Twitter

7. http://blog.Twitter.com/2014/the - 2014 - yearontwitter

8. O'Reilly Radar Team, Planning for Big data, A CIO's Handbook to changing the Data Landscape.

9. Miloš Popović, Milan Milosavljević, Pavle Dakić. T witter data analytics in education Using ibm infosphere biginsights. The Internet and

Development Perspectives, International Scientific Conference On ICT And E-Business Related Research, sinteza 2016.

10. https://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_qse.html

11. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. The 26th IEEE Symposium on Mass Storage Systems

and Technologies, pp. 1-10, May 2010.

12. Apache Hadoop. https://hadoop.apache.org/.

13. Deepak Vohra. Practica l Hadoop Ecosystem . Chapter Apache Flume, pp 287-300, September 2016.

14. Rodríguez M. , L., Rodríguez E., CA., Sánchez C., J.L. et al. J Supercomput (2016) 72: 3073. doi:10.1007/s11227- 015-1501 -1

15. Marshall C., Julian S., Anthony P., Mike M., David G., "Overview of Microsoft Azure Services", Microsoft Azure, Part 1, 2015

16. Cloudera. http://www.cloudera.com/.

17. Hortonworks. http://hortonworks.com/.

18. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up Sentiment classification using machine learnin g techniques. In Proceedings of th e Conference

on Empirical Methods in Natural Language Processing (EMNLP), pages 79 86, 2002.

19. P. Turney. Thumbs Up or Thumbs Down. Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL'02, 2002.

20. K. Dave, S. Lawrence, and D. Pennock. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. 2003.

21. M. Gamon, A. Aue, S. Corston-Oliver, and E. K. Ringger. Pulse: Mining customer opinions from free text. IDA'2005.

22. M. Hu and B. Liu. Mining and summarizing customer reviews. KDD'04, 2004.

23. S. Kim and E. Hovy. Determining the Sentiment of Opinions. COLING'04, 2004.

24. G. Vinodhini and R. Chandrasekaran. Sentiment analysis and opinion mining: A survey. International Journal, vol. 2, no. 6, 2012.

25. P. Turney. Thumbs up or thumbs down Semantic orientation applied to unsupervised classification of reviews. Proceedings of the Association

for Computational Linguistics.

26. Hu M. and Liu B. Mining and Summarizing Customer Reviews . KDD '04 Proceedings of the tenth ACM SIGKDD international conference on

Knowledge discovery and data mining, Pages 168- 177.

27. Wils on T., Wiebe J. and Hoffmann P. Recognizing Contextual Polarity in Phrase- Level Sentiment Analysis. In the Advanced Research and

Development Activity (ARDA).

28. Wu, Yuanbin, Qi Zhang, Xuanjing Huang, and Lide Wu. Structural opinion mining for graph-based sentiment representation. In Proceedings

of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP- 2011). 2011.

29. Ilkyu Ha, B onghyun Back, and Byoungchul Ahn. MapReduce Functions to Analyze Sentiment Information from Social Big Data. Hindawi

Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 417502, 11 pagesn

http://dx.doi. org/10.1155/2015/417502

30. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc.

31. P.B. Makeshwar, A. Kalra, N.S. Rajput, K.P. Singh. Computational Scalability with Apache Flume and Mahout for Large Scale Round the

Clock Analysis of Sensor Network Data. National Conference on Recent Advances in Electronics & Computer Engineering, 2015.

32. https://www.ibm.com/analytics/us/en/technology/hadoop/bigsheets/

33. K.S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, E.J. Shekita, Jaql: a scripting language for large scale

semistructured data analysis, Proc. VLDB Conf. (2011).

Marouane Birjali et al. / Procedia Computer Science 113 (2017) 280–285 285

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 5

4.2. Methodology

In this paper, we will solve the question of analyzing large data from social networks using Hadoop to simplify data

processing. We implemented the Apache Hadoop platform and IBM BigInsights for easy analysis of Big Data issues.

In this perspective, analysis of tweets involves moving on to the following steps to overcome problems in the

traditional system. The data collected is unstructured format. T he script JAQL33 is used to extract important data,

transforming them into a simpler structure to convert the delimited file by commas, and then storing the data in HDFS

to perform the processing using MapReduce30 . For the graphical visualizations and final data manipulations, we use

BigSheets. BigSheets is a spreadsheet-style tool provided with BigInsights InfoSphere to allow standard spreadsheets

functions, filter data, join tables, sort data, and visualize data in graphs32 . The Figure 3 above shows total percentage

coverage by languages. To achieve this result, we just group data from the Tweets data and provide the total count of

tweets in every group.

Fig. 3. Coverage by language in a pie chart.

The Figure 4 present the BigSheets Twitter Analysis Number of tweets during the time, and for the figure 5

depicts a tag cloud chart we generated for the words. As with any BigSheets tag cloud, larger font indicates more

occurrences of the data value and scrolling over a data value reveals the number of times it occurred in the collection.

Fig. 4. Visualization by BigSheets Twitter Analysis for Number of

tweets during the time.

Fig. 5. Visualization by BigSheets Twitter Analysis for Tag cloud.

Marouane Birjali / Procedia Computer Science 00 (20 15) 000 000 6

5. Conclusion

As part of this work, we present a way of collecting social media data using Apache Flume, analyzing and

visualizing the twitter data using BigInsights InfoSphere . This platform is not only applicable for streaming,

processing, analyzing, and visualizing the twitter data but also to enhanced to apply other types of Big D ata from

various sources. This paper shows that processing time for analysis of massive twitter data by using the proposed work

when compared to other traditional processing methods for Big D ata.

References

1. Statistic Brain. Twitter Statistics, Retrieved from http://www.statisticbrain.com/ twitter-statistics/, 2014.

2. R. Li, K. H. Lei,R. Khadiwala, Chang. TEDAS: A Twitter-based Event Detection and Analysis System. icde, pp.1273 -1276, 2012 IEEE 28th

International Conference on Data Engineering, 2012.

3. Peter Lake, P aul Crowther. Concise Guide to Databases: A Practical Introduction. Springer-Verlag London 2013.

4. Thomas H. Davenport. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities. Harvard Business Review Press,

5. D. Terrana, A. Augello, and G. Pilato. Automatic unsupervised polarity detection on a Twitter data stream. in Proc. IEEE Int. Conf. Semantic

Comput., Newport Beach, CA, USA, Sep. 2014, pp. 128 134.

6. http://en.wikipedia.org/wiki/Twitter

7. http://blog.Twitter.com/2014/the - 2014 - yearontwitter

8. O'Reilly Radar Team, Planning for Big data, A CIO's Handbook to changing the Data Landscape.

9. Miloš Popović, Milan Milosavljević, Pavle Dakić. T witter data analytics in education Using ibm infosphere biginsights. The Internet and

Development Perspectives, International Scientific Conference On ICT And E-Business Related Research, sinteza 2016.

10. https://www.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_qse.html

11. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. The 26th IEEE Symposium on Mass Storage Systems

and Technologies, pp. 1-10, May 2010.

12. Apache Hadoop. https://hadoop.apache.org/.

13. Deepak Vohra. Practica l Hadoop Ecosystem . Chapter Apache Flume, pp 287-300, September 2016.

14. Rodríguez M. , L., Rodríguez E., CA., Sánchez C., J.L. et al. J Supercomput (2016) 72: 3073. doi:10.1007/s11227- 015-1501 -1

15. Marshall C., Julian S., Anthony P., Mike M., David G., "Overview of Microsoft Azure Services", Microsoft Azure, Part 1, 2015

16. Cloudera. http://www.cloudera.com/.

17. Hortonworks. http://hortonworks.com/.

18. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up Sentiment classification using machine learnin g techniques. In Proceedings of th e Conference

on Empirical Methods in Natural Language Processing (EMNLP), pages 79 86, 2002.

19. P. Turney. Thumbs Up or Thumbs Down. Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL'02, 2002.

20. K. Dave, S. Lawrence, and D. Pennock. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. 2003.

21. M. Gamon, A. Aue, S. Corston-Oliver, and E. K. Ringger. Pulse: Mining customer opinions from free text. IDA'2005.

22. M. Hu and B. Liu. Mining and summarizing customer reviews. KDD'04, 2004.

23. S. Kim and E. Hovy. Determining the Sentiment of Opinions. COLING'04, 2004.

24. G. Vinodhini and R. Chandrasekaran. Sentiment analysis and opinion mining: A survey. International Journal, vol. 2, no. 6, 2012.

25. P. Turney. Thumbs up or thumbs down Semantic orientation applied to unsupervised classification of reviews. Proceedings of the Association

for Computational Linguistics.

26. Hu M. and Liu B. Mining and Summarizing Customer Reviews . KDD '04 Proceedings of the tenth ACM SIGKDD international conference on

Knowledge discovery and data mining, Pages 168- 177.

27. Wils on T., Wiebe J. and Hoffmann P. Recognizing Contextual Polarity in Phrase- Level Sentiment Analysis. In the Advanced Research and

Development Activity (ARDA).

28. Wu, Yuanbin, Qi Zhang, Xuanjing Huang, and Lide Wu. Structural opinion mining for graph-based sentiment representation. In Proceedings

of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP- 2011). 2011.

29. Ilkyu Ha, B onghyun Back, and Byoungchul Ahn. MapReduce Functions to Analyze Sentiment Information from Social Big Data. Hindawi

Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 417502, 11 pagesn

http://dx.doi. org/10.1155/2015/417502

30. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc.

31. P.B. Makeshwar, A. Kalra, N.S. Rajput, K.P. Singh. Computational Scalability with Apache Flume and Mahout for Large Scale Round the

Clock Analysis of Sensor Network Data. National Conference on Recent Advances in Electronics & Computer Engineering, 2015.

32. https://www.ibm.com/analytics/us/en/technology/hadoop/bigsheets/

33. K.S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, E.J. Shekita, Jaql: a scripting language for large scale

semistructured data analysis, Proc. VLDB Conf. (2011).

... The main categories are classification, estimation, prediction, clustering, association, and profiling. As we studied in the literature [1][2][3][4][5][6][7][8][9][10][11], various data mining models are proposed for Big Data analytics. The missing data problem is a setback for big data analytics. ...

... A Graph Neural Network (GNN) model was enhanced for the missing data imputation (MDI) application. With the help of GNN, a Graph convolutional autoencoder was designed to reconstruct the complete dataset [5]. The adversarial loss and global information were included in the dataset during the reconstruction phase. ...

  • Pooja Choudhary
  • Kanwal Garg

The big data pattern analysis suffers from incorrect responses due to missing data entries in the real world. Data collected for digital movie platforms like Netflix and intelligent transportation systems is Spatio-temporal data. Extracting the latent and explicit features from this data is a challenge. We present the high dimensional data imputation problem as a higher-order tensor decomposition. The regularized and biased PARAFAC decomposition is proposed to generate the missing data entries. The biases are created and updated by a chaotic exponential factor in Adam's optimization, which reduces the imputation error. This chaotic perturbed exponentially update in the learning rate replaces the fixed learning rate in the bias update by Adam optimization. The idea has experimented with Netflix and traffic datasets from Guangzhou, China.

... Some prominent open-source tools and technologies for data stream analytics include NoSQL [41] , Apache Spark [42][43][44] , Apache Storm [45] , Apache Samza [46,47] , Yahoo! S4 [48] , Photon [49] , Apache Aurora [50] , EsperTech [51] , SAMOA [52] , C-SPARQL [53] , CQELS [54] , ETALIS [55] , SpagoWorld [56] . Some proprietary tools and technologies for streaming data are Cloudet [57] , Sentiment Brand Monitoring [58] , Elastic Streaming Processing Engine [59] , IBM InfoSphere Streams [16,60,61] , Google MillWheel [46] , Infochimps Cloud [56] , Azure Stream [62] , Microsoft Stream Insight [63] , TIBCO StreamBase [64] , Lambda Architecture [6] , IoTSim-Stream [65] , and Apama Stream [62] . ...

... IBM InfoSphere Streams can handle millions of messages or events in a second with high throughput rates, making it one of the leading proprietary solutions for real-time applications [61] . Apama Stream Analytics is suitable for real-time and high-volume business operations [62] . ...

Recent advances in computer networking, smart cities, smart grid, remote sensing, surveillance, telecommunication, and social media have led to a high volume of streaming data. The amount of data generated for the past two years is more than what has been in the history of the entire human race. This high volume, high‐traffic, and brief life‐span data need online analysis and intelligent processing to uncover useful and exciting information that is contained in them. To expand the existing knowledge in the domain of data science, broad areas on streaming data and data streams, which embrace data stream mining issues, streaming data tools and technologies, streaming data pre‐processing, streaming data algorithms, and strategies for processing data streams, were discussed in this article. The article also recommends the best practices for managing data streams and suggests the way forward.

... However, the big data mining approaches found in the literature seldom address the temporality. e.g., [209][210][211][212] propose various social media mining approaches, to analyze and visualize Twitter clusters, to extract customers' opinions on product features, to analyze customer requirements. All these approaches do not touch any time aspects of data nor analytics. ...

The main purpose of this paper is to provide a theoretically grounded discussion on big data mining for customer insights, as well as to identify and describe a research gap due to the shortcomings in the use of the temporal approach in big data analyzes in scientific literature sources. This article adopts two research methods. The first method is the systematic search in bibliographic repositories aimed at identifying the concepts of big data mining for customer insights. This method has been conducted in four steps: search, selection, analysis, and synthesis. The second research method is the bibliographic verification of the obtained results. The verification consisted of querying the Scopus database with previously identified key phrases and then performing trend analysis on the revealed Scopus results. The main contributions of this study are: (1) to organize knowledge on the role of advanced big data analytics (BDA), mainly big data mining in understanding customer behavior; (2) to indicate the importance of the temporal dimension of customer behavior; and (3) to identify an interesting research gap: mining of temporal big data for a complete picture of customers.

... Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data [3]. It has a simple and flexible architecture based on streaming data flows. ...

... The first stage of data collection form social media is to identify related keywords, and then the collection can be actioned from Facebook (Ketter, 2016), (Sitta et al., 2018), Instagram (Kale, 2016), TripAdvisor (Chang et al., 2017), while Twitter appears to be the most used data source for SMA applications (Birjali et al., 2017), (Ahuja & Shakeel, 2017), (Öztürk & Ayvaz, 2018), (Kim et al., 2014), (Saif et al., 2016), (Culotta & Cutler, 2016), (Pournarakis et al., 2017),(X. Zhang et al., 2011), (Goel & Mittal, 2012), (Attigeri et al., 2016). ...

Social media analytics appears as one of recently developing disciplines that helps understand public perception, reaction, and emerging developments. Particularly, pandemics are one of overwhelming phenomena that push public concerns and necessitate serious management. It turned to be a useful tool to understand the thoughts, concerns, needs, expectations of public and individuals, and supports public authorities to take measures for handling pandemics. It can also be used to predict the spread of the virus, spread parameters, and to estimate the number of cases in the future. In this chapter, recent literature on use of social media analytics in pandemic management is overviewed covering all relevant studies on various aspects of pandemic management. It also introduces social media data sources, software, and tools used in the studies, methodologies, and AI techniques including how the results of the analysis are used in pandemic management. Consequently, the chapter drives conclusions out of findings and results of relevant analysis.

... Big data is a condition in which conventional database systems have the inability to enlarge capacity already created. At present, data on conventional database systems is very large and growth is too fast, so the architecture of conventional database systems is not able to deal with these conditions [4] Big data has three characteristics: volume, speed and variety. The organizational context is crucial to an organization's success. ...

The purpose of writing this paper is to look at how factors of big data adoption affect the users of social media and brand popularity. The method used in the writing of this paper is by using path analysis and also by doing literature study to find reference material. The results of the analysis indicate that there is influence of big data adoption factors that affect the popularity of the brand. With this research, we get big data model that influence the brand popularity.

  • Paresh Gupta
  • Shruti Khanduja
  • Suresh Kumar
  • Kamlesh Sharma

In recent years, social media has become the best platform to advertise products or gain popularity in one way or another. The only way to figure out the public demand is by analyzing the advertisements or product launch, which is gaining increased popularity. On various social media sites, there are billions of users who regularly share their opinions. As a result, a vast amount of data is generated. To identify useful patterns from this data, it must be analyzed. Henceforth, this paper discusses the importance of social media data analysis and its benefits to the business. Also, this paper has utilized the customer support dataset from Twitter for further analysis. This research work considers a large, modern corpus of tweets from customers and their replies. The pandas-profiling tool is used to perform the analysis. The overall outcome of this analysis is about the improvements required in the services of different companies which in return if addressed timely, can increase the overall profit of the companies.

  • Barkha Singh
  • Neha Firdaush Raun
  • Nagendra A. Sole

The chapter explores transdisciplinary studies in IoT-based healthcare solutions, their representation through Social media big data. The current situation demands attention to develop socio-cyber spaces from where control and data related to trace, track, and control be possible. The current pandemic has been the subject matter of literary arts and social media. Whenever a lockdown situation happens in a country or the state closure is in an effective mode due to an epidemic or pandemic, the mass writing helps people to be aware of the real-time situation. Nowadays technology is added to this task of the mass writing and explores wisdom regarding the problems. Presently, as Coronavirus is spreading globally, we do not have a choice but to accept government policies to maintain social distancing to control the spread of this virus, but it is not easing for everyone to sit at home in isolation. During this pandemic across the world, social media is a resource for people to stay connected virtually, helps to engage and entertain people, and to spread positivity around. Social media has come to our rescue more than ever and helping us to cope with the quarantine. Applications like WhatsApp, Facebook, Twitter, LinkedIn keep everyone connected with his/her family, friends, and colleagues who are quarantined during the period. Hence the chapter reflects the aspect of cognitive function of Social media. From online meetings to sharing homemade recipes, to online classes and getting an update on COVID-19. Social media compensates our boredom during this pandemic. The authors observed that the compensation caused a psychological displacement in people. For example, the doctors uses social media to educate people about COVID-19. The YouTube live streaming classes for better learning which enhances the growth of a kids, is another example. The user information collected on social media platforms allows marketers to have a better understanding of the customer behavior, target audience groups, and engagements. Based on these reviews, the chapter attempts to explore the immaculate use of IOT and Big Data Analytics for healthcare governance.

Sentiment analysis (SA), also called Opinion Mining (OM) is the task of extracting and analyzing people's opinions, sentiments, attitudes, perceptions, etc., towards different entities such as topics, products, and services. The fast evolution of Internet-based applications like websites, social networks, and blogs, leads people to generate enormous heaps of opinions and reviews about products, services, and day-to-day activities. Sentiment analysis poses as a powerful tool for businesses, governments, and researchers to extract and analyze public mood and views, gain business insight, and make better decisions. This paper presents a complete study of sentiment analysis approaches, challenges, and trends, to give researchers a global survey on sentiment analysis and its related fields. The paper presents the applications of sentiment analysis and describes the generic process of this task. Then, it reviews, compares, and investigates the used approaches to have an exhaustive view of their advantages and drawbacks. The challenges of sentiment analysis are discussed next to clarify future directions.

  • Navdeep Bohra
  • Vishal Bhatnagar

Social media has become a tremendous source to bring in new clients. Sharing posts for new offers/products to get extensive client engagement can be predicted by grouping the users based on their previous interactions. In this paper, we improve existing state-of-the-art techniques to predict group-level popularity by extending the data clustering approach and constraint network prediction using stochastic Adam optimization. Various other topological properties of this two-level approach are also tested. The Adam optimization for clustered group prediction improves the relative error substantially. Overall, the proposed novel approach improved the prediction popularity accuracy by a significant difference of 18.21%.

In this paper, a typical scenario has been considered wherein gas sensor array responses from a WAN deployed sensor network are being received hourly, 24×7. From every sensor node, we are retrieving Static as well as Dynamic Responses with 16 sensing elements generating a .csv file of 9 MB size. Considering 1000 sensor nodes, the data received at the Hadoop Cluster at our Data Centre would be about 9 GB, which can be even more if more number of nodes, over larger geographical area and/or higher density of nodes is considered. Hence, (i) to receive and store such a huge data from a sensor network and (ii) to analyse the received data, we explored the suitability of Apache Flume and Apache Mahout to deliver high performance computational scalability on Hadoop Distributed File System. In this work, an implementation methodology for realization of such a scalable system has been presented by considering a sensor network for air pollution observation over a large geographical area, as an example.

Big Data has become a very popular term. It refers to the enormous amount of structured, semi-structured and unstructured data that are exponentially generated by high-performance applications in many domains: biochemistry, genetics, molecular biology, physics, astronomy, business, to mention a few. Since the literature of Big Data has increased significantly in recent years, it becomes necessary to develop an overview of the state-of-the-art in Big Data. This paper aims to provide a comprehensive review of Big Data literature of the last 4 years, to identify the main challenges, areas of application, tools and emergent trends of Big Data. To meet this objective, we have analyzed and classified 457 papers concerning Big Data. This review gives relevant information to practitioners and researchers about the main trends in research and application of Big Data in different technical domains, as well as a reference overview of Big Data tools.

Opinion mining, which extracts meaningful opinion information from large amounts of social multimedia data, has recently arisen as a research area. In particular, opinion mining has been used to understand the true meaning and intent of social networking site users. It requires efficient techniques to collect a large amount of social multimedia data and extract meaningful information from them. Therefore, in this paper, we propose a method to extract sentiment information from various types of unstructured social media text data from social networks by using a parallel Hadoop Distributed File System (HDFS) to save social multimedia data and using MapReduce functions for sentiment analysis. The proposed method has stably performed data gathering and data loading and maintained stable load balancing of memory and CPU resources during data processing by the HDFS system. The proposed MapReduce functions have effectively performed sentiment analysis in the experiments. Finally, the sentiment analysis results of the proposed system are very close to those of manual processes.

In this paper we propose a simple and completely automatic methodology for analyzing sentiment of users in Twitter. Firstly, we built a Twitter corpus by grouping tweets expressing positive and negative polarity through a completely automatic procedure by using only emoticons in tweets. Then, we have built a simple sentiment classifier where an actual stream of tweets from Twitter is processed and its content classified as positive, negative or neutral. The classification is made without the use of any pre-defined polarity lexicon. The lexicon is automatically inferred from the streaming of tweets. Experimental results show that our method reduces human intervention and, consequently, the cost of the whole classification process. We observe that our simple system captures polarity distinctions matching reasonably well the classification done by human judges.

Due to the sheer volume of opinion rich web resources such as discussion forum, review sites , blogs and news corpora available in digital form, much of the current research is focusing on the area of sentiment analysis. People are intended to develop a system that can identify and classify opinion or sentiment as represented in an electronic text. An accurate method for predicting sentiments could enable us, to extract opinions from the internet and predict online customer's preferences, which could prove valuable for economic or marketing research. Till now, there are few different problems predominating in this research community, namely, sentiment classification, feature based classification and handling negations. This paper presents a survey covering the techniques and methods in sentiment analysis and challenges appear in the field.

  • Peter David Turney Peter David Turney

This paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not rec- ommended (thumbs down). The classifi- cation of a review is predicted by the average semantic orientation of the phrases in the review that contain adjec- tives or adverbs. A phrase has a positive semantic orientation when it has good as- sociations (e.g., "subtle nuances") and a negative semantic orientation when it has bad associations (e.g., "very cavalier"). In this paper, the semantic orientation of a phrase is calculated as the mutual infor- mation between the given phrase and the word "excellent" minus the mutual information between the given phrase and the word "poor". A review is classified as recommended if the average semantic ori- entation of its phrases is positive. The al- gorithm achieves an average accuracy of 74% when evaluated on 410 reviews from Epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). The ac- curacy ranges from 84% for automobile reviews to 66% for movie reviews.

  • Soo-Min Kim
  • Eduard Hovy Eduard Hovy

Identifying sentiments (the affective parts of opinions) is a challenging problem. We present a system that, given a topic, automatically finds the people who hold opinions about that topic and the sentiment of each opinion. The system contains a module for determining word sentiment and another for combining sentiments within a sentence. We experiment with various models of classifying and combining sentiment at word and sentence levels, with promising results.

  • Minqing Hu
  • Bing Liu Bing Liu

Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques.

  • Marshall Copeland
  • Julian Soh
  • Anthony Puca
  • David Gollob

As you saw in Chapter 1, Microsoft Azure represents computing capabilities. What does that mean? That Azure strives to be the foundation of modern computing and continues to evolve. The services presented are a snapshot in time, and you should expect new services to be introduced at an accelerated pace.