Database Creation and Text Analysis in Services

Our EQS software, based on the text analysis in services (e.g. “I am looking for a nursery in Brno which takes children as young as 1 year old”), will present the user with appropriate suppliers.

To correctly pair the data, the search algorithm requires a sufficient amount of data for learning. We have approached this problem by creating web crawlers owing to which we received needed data in the Czech language from external sources. However, predominantly, we are creating our own database of all activities located in the Czech Republic.

 

 

 

 

 

 

 

When creating it, we did not want to be limited only to the set of services (for example, the mentioned nursery or children’s group), nor occupations (teacher, nanny, …) but the aim was to create a complete database of all activities which people can possibly perform. For this reason, we merged mentioned areas and supplemented them with additional activities (for example, “babysitting”, or more detailed “night-time babysitting”).

The primary input for creating the database was the “National System of Occupations” which was further extended by categories from commercial enquiry servers. In this way, we created database areas, or more precisely, type clusters teacher/nanny/teacher assistant (=occupation) + nursery/children’s group (=service) + babysitting/children’s programme (=activity). We collectively refer to these categories as activities.

All activities were supplemented with keywords that are typical for them (children to nurseries, babysitting, …). Since our algorithm attaches the highest weight to keywords, these keywords are far more important than the names of the respective activities, therefore, the above-mentioned clusters are made by a set of words associated with the given activity/areas of activities.

When aggregating keywords, we used both automated and man-made databases and, last but not least, our own descriptions or suggestions of suppliers we have been calling to over the last 6 months to offer them a free presentation of their services on our test portal mojilidi.cz.

The primary database was afterwards published on the above-mentioned website and we started facing the real operation. The ones who were interested in the presentation of their services from any areas, entered a description of their activity to the search bar, for example, “We are running a children’s group in Brno which specialises in ABA therapy, speech therapy or exercising with kids.”

 

 

In response to the analysis of the input text, the users are presented with activities which are identified as most relevant (for example Children’s Group, Night Babysitting, Babysitting or wrongly Nutrition Therapist).

 

 

Users have an option either to apply to an already existing activity or to edit/add keywords and description regarding their services or add a completely new activity.

Adding a new activity is subject to confirmation by an administrator so that there are not double values such as taking care of a kid / taking care of kids. Considering the principles of the search algorithm, adding new similar activities would not be a problem. However, for tracking statistics or applying to an existing activity with the most relevant keywords, we are trying to approve only entirely new / not yet given activities.

By doing this, we have been complementing our own database over a year. Activities with a higher number of users have a database of the most interesting keywords and phrases which are recommended to users straight away at the registration.

 

 

We also track which (and how many) activities include a particular keyword, see keywords listed above. Furthermore, we track the number of competition in individual regions. When comparing results gained from telephone calls with users and also by an analysis of new customers acquired from advertising in particular areas (for example, we are finding out that car repair shops are not interested in registering whereas text proofreaders are highly interested), interesting statistics about individual market segments are being developed.

 

The picture is related to Project Architect activity.

 

The database is constantly growing and updating with the ever-growing number of users. Likewise, our search algorithm is getting better and is offering more relevant results. In the following article, we are going to present how we translated our database into English and German and what interesting features have been accomplished by this.

Jiří Fuchs

Suitable representatives for a set of reviews

As we mentioned in the previous post, our team is working on a project to help you make decisions about buying different products and services. We try to help users create an objective view of the specific items they want to buy by analyzing published reviews of other users. Currently, we’ve downloaded enough reports and product articles in Czech and English language to analyze individual views. In the first phase it was necessary to adapt the obtained texts to the form suitable for analysis.

It was necessary to divide the documents into individual sentences, because users often present more ideas in one document and evaluate more criteria. The next step was to remove insignificant words that do not bring any or just little information value. For example, clutches, prepositions, web addresses, and so on. In this step, we also used our own POS analyzer, which assigns the words in sentence word types, and our own dataset with stop words. In particular, nouns, adjectives and verbs were interesting for us. Subsequently, we worded the words into their basic shapes, by specifying the roots of words.

We have transformed the edited documents into vector shape using tf-ifd and then split them into clusters with the same themes using k-means methods. We have managed to identify approximately diversified clusters with a high degree of internal integrity. Identified topics were related to the main parameters of the product segment surveyed.

The clusters created for the whole segment, based on expert articles, were then used to classify product reviews. From identified clusters for individual reviews, we chose those with the highest predictive value – and are presented as a suitable representative for a given set of reviews. The result of the analysis is shown in the example below.

Jan Přichystal

We have downloaded over half a million user reviews

We will inform you about the milestones we have achieved in analyzing text reviews. Let’s take a look at our research.

Motivation

Our team is currently working on a project to help make decisions about buying different products. A huge amount of opinions and reviews of individual products can be found on the Internet.
These user reviews are distributed across a variety of discussion forums, product rating sites, or specific portals. For a regular user, it’s difficult to find the information needed, get a look at them, and make its own opinion.

Methodology

In order to analyze large amounts of unstructured data, we have decided to use machine learning methods. We want to use these data to identify topics that are important to users and to determine their positive or negative attitudes towards individual product features.

Current status

We are currently working on creating crawlers for downloading user reviews and articles about the selected product group. These crawlers are tailored to the structure of specific sites. Crawlers from these sites get relevant data that can help in analyzing themes and attitudes. So far, we have created eight crawlers, which have helped us to download about half a million user reviews and expert articles about two thousand products in two languages ​​(Czech and English).

Problems solved

We had to deal with several issues when acquiring the data. One of the main ones is the different way of labeling products on different sites. Although it is an identical product, there are differences in names that complicate product pairing. Another problem is limiting the number of accesses to some sites in the form of code captcha. The last issue that needs to be solved is the changing web structure that causes crawlers to fail.

Conclusion

We have a practically closed first phase of the project in which we have defined the task of creating data acquisition tools for subsequent analysis. In the next phase, using machine learning methods, we will work to uncover the topics discussed and attitudes of users.

Jan Přichystal

Text analysis in field of business mediation

Public and private customers increase online spend every year. As new generations of buyers mature there are more and more demands for goods and services available online for suppliers.

In Czech market there are approximately 20 portals that mediate demands with proper suppliers. Customers are promised to find qualified, relevant suppliers that have proper experience and references in exchange for contact details and description of demand. Which is more advanced service than look for proper supplier using web search engines like google.com or seznam.cz but less complex service than using online auction portals that remain domain of enterprise companies purchase departments.

Business mediation portals

Services of mediation portals are free for buyers, yet suppliers are mostly asked for various payments.

When it comes to public customers that are obliged to conduct public tenders in order to find proper supplier – monthly or yearly fees are asked to get up-to-date information about new tenders that appear on thousands of public subject profiles. Such service sends notifications about description and subject of tender to suppliers that match their profile filled during registration and start of service.

Weaknesses of such service are obvious – in case that description and subject of public tender does not match the profile – supplier gets no notification. Which is followed by missing important opportunities in cases that proper description is provided in following attachments and documentation mostly at more complex deals.

In case of private demands suppliers are asked to pay for each demand they are interested in. Such procedure in practice means that suppliers are overwhelmed by small inquires that require payments for contact disclosure.

Mediation by Artificial Intelligence

State of the market remains suboptimal as customers do not get information on the most relevant suppliers (as their demand is shown only to paying ones) and suppliers miss some of the opportunities as well as well as they need to handpick relevant demands for their services.

Artificial Intelligence that would be able to understand:

  • texts of offers of suppliers out of their websites, product information and descriptions of services provided by marketing materials or presentations and
  • texts of inquiries provided by procurement documentation or simply by text generated by customers

has potential to bring significant innovation in the field. Value added may bring:

  • weighted match of inquiry profile and supplier profiles
  • more relevant customer demand and supplier references match
  • checks of qualification criteria for public tenders
  • reduction of administration during registration processes
  • supplier matches that are independent on payments from supplier side
  • valuable market research information

Artificial Intelligence offering

No matter potential benefits our marketing research showed that mediation portal operators have no interest in such innovation in their field.

Our research showed that mediation portal operators are mostly unknown or unavailable – which is understandable as their services do not satisfy many of their clients even though most of businesses have been offered such services (even by more of mediation portals).

Out of mediation portal operators reached none of them develops business by way of product development – preferred way of development is customer acquisition.

Radomír Věntus

The Reviews section launch!

In the introductory article of the Reviews section, we present the Multicriterial Text Analysis Software (MTA) project, which deals with the removal of information asymmetries in news and reviews.

The MTA team of scientists from CYRRUS ADVISORY, a.s. and Mendel University in Brno uses machine learning methods to analyze text in the field of current news and product reviews.

In the following posts, you can look forward to describing the research issues and the results we have already achieved in this area.

Jiří Fuchs