Diego Klabjan
  • Home
  • Vita
  • Publications
  • Contact

CUSTOMER BEHAVIOR FROM WEB AND TEXT DATA

1/21/2015

11 Comments

 
Many sites and portals offer text content on web pages. For example, news aggregators such as The Huffington Post or Google News allow users to browse news stories; membership-based portals focusing on a specific industry, e.g., constructionsupport.terex.com for construction, offer members a one-stop page for the latest and greatest updates in a particular domain; in the service domain DirectEmployers provides site www.my.jobs with job listings for members to explore. A challenge faced by these site providers is to distinguish users that simply browse the site versus those that are actively searching with an end goal, e.g., for DirectEmployers it means distinguishing between the user that actively seek a job vs those only exploring the portal. The former can then be targeted with possible marketing campaigns to provide higher business value.

While traditionally this can be accomplished through web analytics by following page views and not considering the actual textual content on pages, this is no longer satisfactory because modern sites use the html5 technology which enables data collection of users’ interactions with the textual content. By recording user clicks in javascript, new data streams collect and combine user ids with click streams and text content viewed. For example, DirectEmployers records the user id and the job description viewed. This should conceptually enable the company to identify which user is merely browsing the portal vs users that actively search a job.

In order to achieve this, relevant information needs to be extracted from each text description, next a measure of proximity of two extracted information is needed, and in the end a single ‘dispersion’ metrics is computed for each users. The higher the metrics, the more exploratory behavior of the user is. The workflow requires substantial data science and engineering using several tools.

Hadoop’s schema on read is a well suited framework for the bulk of the analysis. Its easy to load concept makes it adequate to simply dump textual descriptions, click data, and user information to the filesystem.

To form relevant information from each text description, Latent Dirichlet Allocation (LDA) can be performed. The process starts by remove stop words from text which is easily accomplished in the Hadoop’s map reduce paradigm. Instead of using raw java, scripting language Pig can be used in combination with user defined functions (UDFs) to accomplish this task in a few lines of code. Next the document-term matrix is constructed. This is again simple to perform in Pig by a single pass through text descriptions and fully exploring concurrency.

LDA which takes the document-term matrix as input is hard to execute in Hadoop’s map reduce framework and thus it is more common to export the matrix and perform LDA in either R or Python since both offer excellent support for LDA. The resulting topics mapped back to the original text content can then be exported back to Hadoop for subsequent steps.

The calculation of distances between text descriptions based on the topics provided by LDA can be efficiently executed in Hadoop by using a self-join in Pig with help of UDFs. Finally, the score for each user is computed by joining user data with clicks and pair-wise distances of text descriptions.

All of these steps can be accomplished by Pig (and select steps are more elegant in Hive) with only a limited number of java code hidden in UDFs and assistance of R or Python.

Without the use of Hadoop’s capability of handling size and variety of data, this analysis would be confined to only user clicks and thus the value provided would be very limited.  

11 Comments
Kala Jadu Mantra for Earn Money in Hindi link
4/24/2015 07:43:14 pm

Excellent post.I want to thank you for this informative read, I really appreciate sharing this great post. Keep up your work.

Reply
online marketing link
4/26/2015 07:04:33 pm

American Marketing Association - the pre-eminent force in marketing for best and next practices, thought leadership and valued relationships,

Reply
Kamdev Vashikaran mantra to control woman link
4/27/2015 05:44:35 pm

Nice Post !! I like it, it also comprises a lot of useful facts. thanks to share your experience.

Reply
web administration service link
5/10/2015 02:27:58 pm

Website Administration & Maintenance Services. Your business is continually evolving, which means you probably need,

Reply
Mantra for love link
5/18/2015 07:00:51 pm

Thank you! It was a wonderful chance to visit this kind of site. I hope you will publish more on this topic. Thanks a lot for sharing with us!

Reply
Love Marriage Specialist Astrologer link
5/18/2015 07:14:58 pm

Nice article. Think so new form of features have included in your article. Waiting for your next article.

Reply
Get love back by Dua link
5/18/2015 11:01:47 pm

I am glad to read such a great post, i am going to read all of your posts. Great article thanks

Reply
Free Vashikaran mantra in India link
5/18/2015 11:17:53 pm

This is a nice post, the word of information shared here i like it.

Reply
girlsdoporn link
6/1/2015 11:01:36 am

I am impressed with this post. The author really needs an appreciation. Amazing work. Keep it up.

Reply
islamic vashikaran mantra link
9/22/2015 04:37:13 pm

Thanks for share.

Reply
Web Development Company | Matic Technology link
11/4/2015 09:24:47 pm

The post is written in very a good manner and it entails many useful information for me. I appreciated what you have done here.
I am always searching for informative information like this.

Thanks For Sharing With Us.

Reply

Your comment will be posted after it is approved.


Leave a Reply.

    Diego Klabjan

    Professor at Northwestern University, Department of Industrial Engineering and Management Sciences. Founding Director, Master of Science in Analytics.

    Archives

    July 2019
    June 2019
    March 2019
    February 2019
    January 2017
    August 2016
    March 2016
    November 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014

    Categories

    All
    Analytics

    RSS Feed