Detecting information transparency in the Italian real estate market: a machine learning approach

This research aims to understand how market transparency and data reliability can influence valuation procedures and decision-making processes in the Italian real estate market. Through the analysis of three different real estate markets and the validation of the information collected, this paper’s goal is to understand whether and to what extent the use of asking prices instead of actual purchase and sale prices can lead to valuation errors, increase the uncertainty of valuation, and undermine investment decision-making processes. The research results highlight the primary sources of information opacity in the Italian real estate market, classifying them according to their impact on real estate value. The novelty of this research lies in the integrated use of machine learning techniques, computer programming and multi-parametric valuation procedures to understand and manage information opacity in the Italian real estate market, particularly regarding the estimation of the market value of properties belonging to the residential segment. Questa ricerca si pone come obiettivo il comprendere come la trasparenza del mercato e correttezza delle informazioni influenzino i procedimenti di stima e i proessi decisionali nel mercato immobiliare italiano. Attraverso l’analisi di tre differenti mercati immobiliari e la verifica delle informazioni relative ai prezzi di offerta, il presente contributo vuole capire se, e in quale dimensione, l’impiego dei prezzi di offerta in luogo dei reali prezzi di compravendita possano portare a commettere errori valutativi, ad aumentare l’incertezza nelle stime e a pregiudicare il processo decisionale negli investi- menti. I risultati della ricerca evidenziano quali sono le fonti primarie di opacità informativa nel mercato immobiliare italiano, classificandole in base al loro impatto sulla stima del valore immobiliare. La novità di questa ricerca risiede nell’uso integrato di tecniche di machine learning, programmazione informatica e procedure di stima multi-parametrica al fine di comprendere e gestire l’opacità informativa nel mercato immobiliare italiano, in particolare riguardo la stima del più probabile valore di mercato degli immobili appartenenti al segmento residenziale.


INTRODUCTION
Market transparency in a given market is related to accessibility and availability and the quality and reliability of data and information.Therefore, transparency is one of the prerequisites of a perfectly competitive market and the equilibrium condition.It translates into the perfect information for market agents (in particular consumers) on the prices and characteristics of goods or services.In a transparent market, good quality data are available and accessible to both demand and offer.In such a case, agents can make purchase or production decisions based on the available data without requiring a high expenditure of resources for their collection.This assumption is crucial for the proper functioning of the market and the achievement of its efficiency.In contrast, if such data is either unavailable or very poor, the market is defined as opaque.
According to Schulte et al. (2005): "Real estate markets can be described as transparent when it becomes clear how the market mechanisms and the variables behind these mechanisms work, i.e., when there is as much information as possible available at any point in time."John Lang Lasalle (2022) (hereafter JLL) defines a transparent real estate market as an open and clearly organised market that operates within a legal and This research aims to understand how market transparency and data reliability can influence valuation procedures and decision-making processes in the Italian real estate market.Through the analysis of three different real estate markets and the validation of the information collected, this paper's goal is to understand whether and to what extent the use of asking prices instead of actual purchase and sale prices can lead to valuation errors, increase the uncertainty of valuation, and undermine investment decision-making processes.
The research results highlight the primary sources of information opacity in the Italian real estate market, classifying them according to their impact on real estate value.The novelty of this research lies in the integrated use of machine learning techniques, computer programming and multi-parametric valuation procedures to understand and manage information opacity in the Italian real estate market, particularly regarding the estimation of the market value of properties belonging to the residential segment.
These forecasting models reflect the "opacity of the market" because they are trained on raw data (opaque information), and selling ads, as will be further discussed, always contain incomplete, misleading, or even wrong information.In a second moment, the databases are re-downloaded by hand.This procedure allows for checking up on the correctness of all the information claimed in the selling ads, producing more transparent databases.The ANNs are implemented on these corrected databases, allowing us to understand how much information opacity influences the market value assessment.Besides, the primary sources of opacity are isolated and analysed.As such, next Section 2 will introduce and clarify some concepts that will be useful to the purpose of this analysis.Then Section 3 will present the method adopted to discuss information opacity in the Italian real estate market, while Section 4 will introduce the three practical case-studies object of this research.Section 5 will present the ANNs training and testing procedures, and Section 6 will illustrate the impact of the variables on the market value forecast.Finally, Section 7 will discuss the conclusions of this work.

Transparency in real estate market
According to transparency theory, there is strict transparency in the market when information sharing occurs with a high degree of equivalence between market participants (supply and demand), while a lack of transparency causes information asymmetry in the market (Yun and Chau, 2013).Information asymmetry is when some market participants are more informed than others concerning the transaction information (Akerloff, 1970), thus leading to distorted outcomes, increased transaction costs, and increased risk.The concept of transparency is conceived differently in the real estate market and is linked to several aspects.Schulte et al. (2005) link it to the possibility of obtaining information on the real estate market and the submarkets into which it is divided.For Linquivitz (2012), transparency mainly concerns real estate transactions and five related aspects: legal information, financing taxation, transaction costs and data.Other authors directly connect the level of transparency to the technological development pervading the real estate market.Transparency requires open access to new and granular information with extensive geographical coverage but, at the same time, entails complete integrity and accuracy of sources (Ionas ¸ cu and Anghel, 2020).In real estate, a more transparent market is believed to attract more investments and investors (Razali and Adnan, 2012).More specifically, the higher the level of real estate transparency (RET), the higher the regulatory framework characterised by a consistent approach to applying rules and regulations.The JLL's Global Real Estate Transparency Index (GRETI) 2022 ranks 94 Countries and 156 Cities according to market transparency.Since 1999, this index has considered several parameters, including investment performance measurement, market fundamentals, governance of listed vehicles, regulatory and legal environment, transaction processes, and sustainability transparency.In addition, the investment performance measurements include property valuations.Highly transparent markets are advancing thanks to technology, climate action, capital markets diversification and regulatory changes.It also marks a growing divergence between the leading and other opaquer markets, with many stalling or falling back in transparency growth.Sustainability was the main driver of transparency growth in the GRETI 2022 index, with many countries adopting mandatory energy efficiency and emission standards for buildings.From the digitalization perspective, the availability of new high-frequency and specific data is increasing the transparency of the real estate sector, leading to a greater understanding of how markets and buildings work.In addition to JLL's reports, in recent years, the relationships between transparency and property market activities, such as investment, property development and valuation, have been examined in several papers (e.g.Farzanegan and Gholipour, 2014;Gholipour and Masron, 2013;Newell, 2016).The bibliography agrees that the problem of transparency in different countries can make real estate valuation, the ability to identify suitable investments, and dialogue between different market players difficult.Moreover, high market transparency is the first line of defence against uncertainty.Nevertheless, real estate transparency is not an unambiguous concept.This paper aims to discuss how information opacity still represents a crucial problem in the Italian real estate market, specifically focusing on access to information in the valuation of the property market value.For a deep understanding of the transparency level in the Italian real estate market and the possible error that would be made in real estate valuations due to lack of transparency, a model for defining and "measuring" the opacity of the market is proposed below.To this extent, market transparency is here investigated with the help of machine learning techniques, and a set of Artificial Neural Networks (ANNs) is therefore developed with the aim of identifying the major sources of opacity in the market.In particular, the ANNs are intended as a multi-parametric forecasting tool able to estimate the market value of a property as a function of several building characteristics.The ANNs are trained on databases downloaded from specific selling websites with the help of an automated downloading procedure developed ad-hoc with the Python® computer language.
Detecting information transparency in the Italian real estate market: a machine learning approach journal valori e valutazioni No. 31 -2022 35 participation and number of foreign investors in real estate (FREI).RET has been investigated in relation with also the default on mortgages -DOM (Gholipour et al., 2020).Due to the globalisation of property transactions and the presence of foreign investors, the demand for (better) information on real estate data has increased significantly (Farzanegan and Fereidouni, 2014), but this could not be applied in every market, as stressed by the last report of JLL (2022).
Each real estate market shows its specific level of transparency, and the professionals in the field of property valuation must cope with the different quality and availability of the data sources they can use.For example, suppose a professional operates in an opaque market.In that case, she/he should base her/his appraisal judgements on poor-quality data, leading to less reliable valuations.If the same professional operates in a transparent market, she/he can dispose of numerous and detailed information, producing a more robust appraisal.Transparency in real estate has to be analysed concerning the peculiarities of the sector, which determine the different functioning of the real estate market compared to any other market (Arnott, 1987).Besides, in the field of property investment, market opacity caused by a lack of detailed information on asset characteristics and prices leads to allocating resources inefficiently and will also increase investment risk.This is related to the problem of information asymmetry, which involves both the consumers, who cannot know all the legal, technical, and economic aspects of the asset they are willing to buy and the investors, who approach the investment with higher risk.In addition, market opacity makes property valuations more uncertain and prevents buyers from exercising control over the price and quality ratio of the properties they intend to purchase (Guerrieri, 2011).A significant problem regarding market opacity, market asymmetry, and the availability of information concerns real estate valuation in Italy.
Italian valuation has faced international standards since the introduction of the first property investment funds and international listed companies in the late 1990s.Until then, an alignment with the other countries had not been necessary, but the arrival of international investors required further development in the Italian real estate and valuation sectors.Indeed, reform was needed in some respects, especially to codify the standard contents of a valuation report and clarify the steps to the estimation of the property value.The International Standards translated and introduced in Italy since the 2000s have sought to identify shared languages and procedures, best practices and market information analysis, contents and processes of a real estate appraisal.However, valuation methods and approaches were less concerning, as they were already used in the Italian appraisal sector.What required more effort was the access and the use of data and information valuations were based on.Adopting International Standards was impossible without first establishing the rules for collecting and analysing market data.Undoubtedly, during the 2000s, in connection with the development of real estate funds and solid market growth, there has been a marked improvement in the quality and quantity of economic information.

Price versus value
For the purposes of this research, i.e. discussing information transparency in the Italian real estate market, it is crucial to distinguish between two key concepts that belong to the field of property valuation: price and value.
As thoroughly explained by (French et al., 2021;Forte e De Rossi, 1974;International Valuation Standards, 2020), The reasons may reside on several grounds, some may certainly be under the valuer's control, but others are not.In fact, the value of an asset may be overestimated/underestimated, for example, due to insufficient analysis, misleading personal perceptions, prior assumptions, or data misinterpretation, which is (in a way) indeed the valuer's responsibility.However, most of the causes of low valuation accuracy might be out of the professional's control.
The paper illustrates that the first cause of low valuation accuracy is when a professional operates in a market with high information opacity.
Since information transparency/opacity is related to data availability and data correctness in a specific market, valuation accuracy and variability highly depend on the reliability of the information provided.

Comparison and comparables
Market transparency is directly related to both availability and correctness of the comparables, in all forms, that a professional can rely upon.In the Italian valuation discipline, one of the principles of real estate valuation states that the method is unique and is founded on comparison (Forte and De Rossi, 1974).Even in the International scenario, the comparison constitutes, in fact, the basis of all the valuation approaches recognised by international standards, i.e. the Market (or Sale) Comparison Approach, the Income Approach and the Cost Approach (Simonotti, 2006).
Value assessment procedures build up a comparison between the property being valued against other comparables, where a comparable can be defined as a property similar to the object of valuation but whose price/income/cost is known and belong to the same submarket.In addition, the comparable characteristics should be similar to those of the object of valuation in terms of intrinsic and extrinsic features, such as, to name a few, location, maintenance level, size, area or building typology.
For this reason, the property data of the properties used in the comparison are a crucial element for the process of determining the market value, whether economic facts (prices, costs) or technical data (quantitative and qualitative characteristics).Based on the quality and quantity of the available data, the valuer will choose the most appropriate approach and method to determine the value (French and Gabrielli, 2018).Usually, soft information is also presented in the form of aggregated data.
In the Italian real estate valuation system, this dual classification could be identified with the terms "direct sources" and "indirect sources", even if there is no total adherence between these two classifications.Direct sources also include the asking prices contained in sell advertisements and listings.On the contrary, in the TEGOVA report, asking prices are classified as soft information, together with indirect sources.This difference between data sources classification in each country is related to the availability of information, local legislation, and the degree of market transparency.
The adequacy of comparables is not universal (as JLL's Index shows), nor can it be used in the same way even within the same country (Iona cu et al., 2021), as in the case of different data available in large cities versus small cities.This implies that international valuation standards should not be overly prescriptive in codifying the appropriate use of comparables, as each data source may play different roles depending on the transparency of the market (Sadayuki et al., 2019).

Relying on asking prices
Sellers indicate the asking price in the hope of attracting potential buyers willing to pay it.This price is identified through valuations, market analysis, or real estate 1 https://www.monitorimmobiliare.it/ agents.This price is placed at the beginning of the sale or sales negotiation and is not always a definitive price, being subject to a margin of negotiation in most cases.
Only the sale price is placed at the end of the negotiation and it is a historical, unchangeable figure.
The two prices are, therefore, diachronic.Scholars repeatedly use asking prices for research related to property market analysis and property price modelling (Pozo, 2009;Hayunga and Pace, 2016;Gordon and Winkler;2016).Moreover, some studies have related and compared the asking prices with the sale prices in real estate markets (Anglin et al., 2003;Beracha and Seiler, 2014;Knight, 2002).Curto et al. (2015) illustrated how the use of asking prices for property valuation is typical in Italy due to the lack of data on sales prices.The same study notes that there could be problems in using such values if they were used in property valuation.An animated debate took place among scholars and real estate market professionals in the summer of 2017 on the online Italian magazine "Monitor Immobiliare" 1 , debating whether asking prices should be used and when.Opinions are not convergent: many professionals assert that the use is necessary when asking prices are unavailable.Others reject the employment of asking prices because they cannot reflect the market value of the real estate, while others accept a form of mitigation.
There is much distance between what is written in the manuals, rules and regulations, and the real estate market, so operators must find adaptation forms necessary for their professional "survival".In Italy, despite greater market transparency and the construction of historical databases, access to data on buying and selling is still limited, meagre or hugely expensive.Also, there is a common misconception that historic price information suggests that investors are not pricing the future.This is not the case; an investor pays an agreed figure for the future expected returns, and thus price captures this view of the future.Valuations are proxies for price and thus also do the same, they reflect current market expectations of the future.The valuation is, therefore, the best estimate of the (future) sale price at the date of the valuation.This value must consider the particular real estate cycle, demand trends and the sustainability of this value in the extended period.Furthermore, in order to do this, adequate information is required.
Italian professionals often use asking prices as comparables.Asking prices are easily available from different online real estate marketplaces and they are cost -free.Following the suggestions of international standards, norms and regulations have recommended which data should be used in property valuations.The Italian standards (UNI 11612:2015(UNI 11612: , 2015) ) advocate that in property characteristics be inferred with an adequate degree of reliability from commercial advertisements?What mistake do valuers make if they use asking prices instead of historical transactions?What are the most significant sources of opacity in the market?

A METHODOLOGICAL APPROACH
The methodology of analysis adopted in this research paper integrates computer programming procedures, machine learning techniques and multi-parametric real estate analysis, intending to understand how and how much the opacity in the Italian market affects the forecasts of market values.For a given market (case study), the following steps are defined.
• In the first step, an automated crawling software is developed in Python® language.A web crawler is a software that is able to browse and analyse the contents of web pages in a methodical and automated way.A web crawler is created for this research in order to parse specific Italian selling websites and automatically download the asking prices of the real estate properties currently on sale.Several building characteristics are also downloaded via the same web crawler alongside the asking prices.This procedure allows one to download thousands of observations easily.Each observation represents a property on sale whose asking price and characteristics are known.The database, collected via the web crawler for a specific market, is generally named DBcrawler.• In the second step, an Artificial Neural Network is developed based on the DBcrawler, therefore being called ANNcrawler, in order to forecast the value of a premise depending on its characteristics in the given market.
To this extent, in the field of machine learning, ANNs are a computational system that is able to learn patterns and procedures (C ' etkovic ' et al., 2018).Neural Networks are made of artificial neurons and artificial synapsis: the neurons are the computational units, while the synapsis connect the neurons between each other.Artificial neurons are organised into multiple separated layers so that the input neurons are displayed in the input layer, and the output neurons are contained in the output layer.At the same time, in between, there are multiple hidden layers of neurons.As in machine learning procedures, a network can "learn" how specific input information flow from the input neurons and generates an output.
A network can be trained on any dataset of inputoutput information, consequently building a predictive model (Pittarello et al., 2021).In this paper, the input neurons of the ANNcrawler are some selected characteristics of the premise, while the output neuron is its correspondent market value.
journal valori e valutazioni No. 31 -2022 38 the case of valuations in which insufficient, undetectable and/or unreliable transactions have taken place in a recent period, asking prices may be taken into account only on a residual basis.However, the relevance of this information must be clearly defined, critically analysed and justified in the valuation report.This practice, however, extends beyond the exceptionality indicated by the Standard.However, it is crucial to emphasise that even online real estate sales sites reflect the opacity of the Italian market.For example, it is possible to compare the data available online in Italy with those on i.e.American real estate websites.In the latter, it is displayed how often and for how many times a property has been bought and sold, the different selling prices, the amount of property taxes, and running costs.All this is possible thanks to the information in the real estate registers, which, in this case, is totally available online.
On the contrary, deeds of purchase are not accessible online on a large scale by real estate Italian players, and often the information is scarce.This is also the case when querying the data of the Agenzia Delle Entrate (Revenue Agency).For example, in the section 'Consultation of real estate price', economic information is available.Still, the information about the real estate unit sold needs to be improved (only the size, cadastral category, and date of the deed of purchase can be known).Besides, even if historical transactions of proper comparables are found, still, it is not possible to ensure the veracity of the information declared in a deed.In fact, the search for comparables in Land Registries may lead to incomplete or wrong information about the transaction, the selling price or the property's characteristics.
Quantitative and qualitative characteristics, considered explanatory price variables, are an information problem when collecting sale prices.Should asking prices, therefore, be considered unsuitable data for real estate valuations?What is certain is that a valuer should be conscious of the level of opacity they bring.
In the present research, the Authors base the entire analysis on asking prices.In particular, the comparables used to train the ANNs are represented by the online ads collected on specific selling websites in Italy.Nevertheless, before debating whether the asking price may be an appropriate value to represent the (future) purchase price, it is necessary to understand how truthful the characteristics described in the advertisements are and how these, if wrongly described for various reasons, may influence the process of estimating market value.How can properties summarily described in an advertisement be considered valid comparable data?Can the consistencies, the state of maintenance, and the Specifically, ANNs here play the role of a multiparametric market value assessment technique.ANNs are, in fact, used to analyse several building characteristics and forecast their market value.In the forecasting function, each independent variable (the building features) contributes differently to the estimate of the price (the dependent variable) (Simonotti, 2006).This way, it will be possible to isolate the impact that every variable brings to the price and understand what kind of error could produce a different level of opacity in information for each building feature.(2) When training the previously defined ANNcrawler on the information contained in DBcrawler, the aim is to minimize the sum of the error signals produced.• In the third step, the Authors intend to verify the level of transparency of the information automatically collected via the web crawler.In fact, besides the inherent inaccuracy of the use of asking prices in property valuations, it is possible to verify that selling ads also contain false statements, wrong information or incomplete data.

Databases downloading: the web crawler
For this research, it was chosen to analyse information opacity in three different Italian markets, specifically the real estate markets in: • Bologna, • Padova, • Treviso.
In the Authors' opinion, these three cities could be good representative case-studies because they embody three different market sizes, where the market size refers here both to the size of the city, the number of elements constituting supply and demand and the volume of transactions.Besides, although it is clear that every real estate market represents a specific case, none of these three cities is a strongly unique case as, for example, the markets in Venice, Rome or Milan might be.The results can therefore be significant also for other similar markets in Italy.
The first problem was understanding how to realistically collect the data and information required to develop the ANNs.As introduced in Section 3, an automated crawling software was developed in Python language to parse the real estate properties on sale in Bologna, Padova and Treviso from specific selling websites and automatically download their asking prices together with several characteristics of the buildings.
The web crawler online search must be targeted by the definition of three different web searching domains, each domain specific to one city.The three domains all comprise residential properties on sale in Bologna, Padova and Treviso.Non-residential properties are excluded from this study, such as commercials and directional, as well as premises on rental.As far as the building typology is concerned, new constructions and existing buildings are both included, such as apartments, attics, terraced houses, and multi/two/single-family villas.For the first domain, all the 10 areas in Bologna are included in the downloading procedure, while the second domain contains all the 14 areas the Municipality of Padova is divided into.Finally, the 7 areas in Treviso are all comprised in the third domain.
In order to allow the web crawler to extract the information from each selling ad, it is necessary to insert inside the Python code each advertisement's own Uniform Resource Locator (URL) in the form of a web address, like "https://...".Since it clearly would have been an unfeasible operation to manually open all the selling ads to copy and paste their web URLs inside the Python code, the goal was to understand how to produce them automatically.It was possible to notice that each advertisement showed an URL given by the combination of the searchresult page URL plus a serial number, where the search-result page can be defined as the list of all the selling ads resulting from the online search in the given domains.Therefore, the web crawler has been programmed to extract the URL of the search-result page first, identify the serial numbers of the advertisment, and automatically build the corresponding URLs.After that, the Python library "Beautiful Soup" was implemented to parse all the HTML pages of the sale advertisement."Beautiful Soup" is a package developed by Leonard Richardson to analyse HTML documents.With the help of this library, it was possible to extract data and information from the HTML texts creating a parse tree for all the parsed pages.A class of objects and functions was built in Python to define the set of information to be extracted from each advertisement.The class used in each domain is illustrated in Table 1.
journal valori e valutazioni No. 31 -2022 Detecting information transparency in the Italian real estate market: a machine learning approach The class of objects and functions defined above has been determined according to a specific analysis of the available information contained in the property selling websites and according to the most common characteristics of the buildings used in multi-parametric market value assessment procedures (Feng and Zhu, 2017;Wang and Xu, 2018).The attributes include structural/physical characteristics, neighbourhood, and location.
After the HTML pages of the selling ads had been parsed, the data analysis library "Pandas "was applied in Python.Developed by Wes McKinney, the "Pandas" library is used to extract a .xlsfile from the web crawling procedure so as to display data and information in the form of a

Databases cleaning: removal of incomplete records and outliers
After the downloading procedure, it was necessary to clean the three databases from missing or misleading observations.First of all, incomplete ads were excluded from the training databases.Then, observations containing obvious errors or outliers were also excluded from the databases, for example, when showing a null selling price or a null floor area.Specifically, the percentages of incomplete/misleading observations are illustrated in Table 2, divided by each city and class element.Therefore, the percentages in Table 2 represent the observations that were eliminated from the sampling because the data was not complete (or correct) in all its classes of objects.For example, the "energy class" and the "construction year" represent the highest percentage of lost observations due to missing information.
As such, the training databases downloaded via the web crawler had to be decreased, respectively, down to 1,665, 2,122 and 867 observations.

Training with the Cuckoo optimisation algorithm
Once the three databases (DBscrawler) had been cleaned from unusable observations, it was possible to perform the training of the three ANNscrawler.journal valori e valutazioni No. 31 -2022 42 taking another 20% of the remaining instances per city, whereas the left-over data create the respective testing sets.Three dataset split ratios have been considered: 80%-10%-10%, 70%-15%-15% or 60%20%20%.The definitive split ratio has been chosen depending on the number of samples and inputs present in the dataset and the model.In this case, the databases are pretty small.However, numerous input neurons make the model more complex, so keeping an adequate number of observations in the selection and testing sets was crucial.
The training procedure was developed in Python code and implemented separately for each city.The training sets are firstly used to generate different ANN models; then, these models run on the selection sets to identify the one ANN that performs best also on the selection set.The testing set is finally used to calculate the error on the forecasts.This training process is performed inside an optimisation procedure that allows to test the different ANN models to minimise the error on the forecasts.The optimisation algorithm employed during the training of the networks is the Cuckoo optimization algorithm (Chiroma et al., 2017;Mareli and Twala, 2018).It is a nature-inspired optimisation algorithm that identifies and compares all the ANN architectures showing a relative optimum (i.e. the minimum error) and chooses the global optimum among them.The Cuckoo search is, in fact, tailored to solve global optimization problems since it employs switching parameters and balancing local and global random walks.
For the sake of simplicity, the neural network developed for Bologna will be indicated as ANN(B)crawler, for Padova as ANN(P)crawler, and for Treviso as ANN(T)crawler.The results of the ANNs developed are presented in Table 3.The mean errors produced on the testing set are 9,50% for Bologna, 11,75% for Padova and 13,55% for Treviso.
Errors are around the same order of magnitude.However, we can see an inverse correlation with the market dimensions: the more significant the market, the lower the error, and vice-versa.

Testing information transparency
This section is dedicated to the use of the previously developed ANNscrawler to discuss how market transparency (or rather its opacity) affects the reliability of the forecasts in Italian markets.For this purpose, the three databases have been recollected by-hand to test the validity of the information that previously has been automatically downloaded via the web-crawler and used to build the networks.The databases re-collected by hand are called DB(B)hand for Bologna, DB(P)hand for Padova, and DB(T)hand for Treviso.
Obviously, the same selection criteria have been employed to identify the web searching domain as for the previous online search.The domains are again limited to residential properties on sale (excluding rental or other types of contracts), comprising existing buildings and new constructions.
During this second database collection, the correctness of all the information was verified, and those properties whose data could not have been verified were directly excluded from DBshand.
For sure, this way of collecting data turned out to be a very long and time-consuming process.Nevertheless, it was still the only way to produce a litmus test to check market transparency, data correctness, and availability of information.
The three databases DBshand have been implemented on the ANNscrawler, so that the correct characteristics of the properties now constitute the input neurons to produce a new market value forecast.As the Authors expected, the error produced on the forecasts is higher than before.The mean error for Bologna is 19,45%, for Padova is 25,57%, and for Treviso is 26,97%.This increase in the error on the forecasts is due to market opacity or, in other words, it is due to the wrong information stated in the ads.The larger the real estate market, the lower the average error rate: smaller markets show less transparency in the data described by property advertisements.
The location turned out to be the major source of opacity.Only a few checked advertisements showed the exact location of the building, whereas most were placed in another wrong street/district.As a general trend, this problem was more accentuated in Padova and Treviso than in Bologna.Besides, several ads had to be even excluded from this analysis because they did not provide enough information (pictures and descriptions) to allow for the exact identification of the position of the house.Certainly, the reasons behind this high opaqueness reside in the privacy requested by the owners and in the specific commercial strategy adopted by the real estate agency.In fact, if a potential buyer is not able to find the position of the house, a brokerage will be necessary.However, the lack of information about the location is excessive and misleading.Several ads indicated a location for the premises, which turned out to be completely wrong, even confusing central, semi-central and suburban areas.Another huge source of opacity was the maintenance level since sometimes it did not match the other information reported in the ads, such as the energy class, the construction year, or the available installations.Generally, the maintenance conditions were optimistically assessed.In some cases, the pictures clearly showed maintenance levels in much worse conditions than the ones publicised in the advertisements.However, this assessment is subjective, and the authors modified the declared maintenance conditions only when undoubtedly different from the pictures provided.Other errors often contained in the ads regarded the identification and definition of the building typology, the availability of a basement, the floor level, the presence of a garage and a private/shared garden.In this regard, the online ads lacked some of this information, or they were not providing consistent data.The advertisements, in fact, contained both a description and a table, and some information presented in the description was completely different from those in the table.Again, the definition of penthouses was usually vague and lacked transparency.The Italian word used in the advertisment to indicate a penthouse, i.e. "attico", should be used for luxury apartments placed at the last floor of a building.However, in some advertisements, the term "attico" was also used for regular or cheap apartments and attics.Due to the uncertainty given by this parameter, the Authors decided to flag just the condition of the "last floor", with no reference to the luxury level.Other errors regarded the installations: the air conditioning or the mechanical ventilation systems were declared to be present in the house, but just the overall structure of pipes and ducts was prearranged for a hypothetical future installation.There was also another branch of issues highlighting information opacity that was not directly related to the characteristics of the buildings.The online ads suffered a high modification/expiration rate.As a consequence, many advertisment were removed, reintegrated or modified in a time span of a few weeks only.This has led to require frequent refreshes and re-downloading of the databases, and the automated web-crawler developed in the frame of this research turned out to be extremely useful.However, also constant manual checks had to be performed multiple times, significantly increasing the total workload.Another issue regarded an asymmetric distribution of information.For example, the ads for new buildings were usually richer in data than in the case of old buildings.As a consequence, more advertisements for old buildings had to be discarded during the manual check due to the lack of essential information.For this reason, the three databases had comparatively more advertisment about new buildings.

DEFINING VARIABLES IMPORTANCE
In the paragraph above, the primary opacity sources have been discussed.However, not all the errors in the ads produce the same impact on the market value prediction.The opacity in certain information is much more significant than in other.For this reason, it may be helpful to determine which input parameter shows the highest impact on the market value through a feature importance analysis.
Among the approaches that help calculating the variables' impact on the output are the Filter-based, the Wrapper, and the Embedded methods (Tatwani and Kumar, 2019).Filter methods are based on univariate statistics, such as Pearson's correlation coefficient, the chi-square test, Fisher's Score, the Variance Threshold, the Dispersion ratio or the Mean Absolute Difference.The Wrapper based approaches consider the selection of a set of features as a search problem (Ghosh et al., 2020;Suresh and Narayanan, 2019;Yassi and Moattar, 2014), such as the Forward Feature Selection, the Backward Feature Elimination, the Exhaustive Feature Selection, or the Recursive Feature Elimination.Finally, Embedded methods combine the qualities of both Filter and Wrapper methods, such as the LASSO Regularization and the Random Forest.Feature importance is assessed here using the Random Forest (RF) methodology.This approach was chosen because Embedded methods are highly accurate and show excellent generalisation properties (Siham et al., 2021).A Random Forest is a particular classifier formed by a set of decision trees (simple classifiers) (Ugolini, 2014), where a decision tree, in the field of computer science, is a data structure made from nodes and arcs.A decision tree is read from top to bottom.The tree's nodes are the elements that contain the information, while the arcs are the connections between the nodes.The starting node is the root and does not have any input arcs, whereas the terminal nodes are named the leaves and do not show any outgoing arcs.
Each decision tree in a Random Forest procedure is built (i.e.trained) from a random subset of the training set.In this case, the three training sets are the buildings' information databases respectively collected for Bologna, Padova and Treviso.Besides, each decision tree is also built over a random extraction of the features analysed (i.e. the buildings information).This randomness in selecting features and observations is a key part of constructing the classifiers, and it is meant to increase their diversity to decrease their correlation.In order to define the importance of each feature, it is necessary to measure how much each feature decreases its impurity during the training.In fact, the more a variable diminishes its impurity, the more significant that variable turns out to be.In classification (discrete variables), the impurity is given by the Gini impurity or by the information gain/reduction in entropy.In regressions (continuous variables), instead, the impurity is given by the variance.A column matrix was defined, where the x-axes represent the features (the variables), and the y-axis shows the target (market value).The "Numpy" library was used to perform the RF-Regressor, and the analysis was conducted in Python.The RF Regressor is able to calculate the importance coefficients for each feature.The 70% of the observations were employed as the training set, whereas the leftover 30% was used as testing set.During the RF procedure, 2,000 trees were built, and the threshold set is 0.75 of the mean value of the importance coefficients calculated.In this case, the decrease in the impurity of each feature is assessed as the average of the decreases given by each tree constituting the forest.This way, the final importance of each variable is estimated.The importance coefficients calculated by the regressor are shown in Table 4. Suppose a piece of wrong information in the ads regards the most impactful data, such as latitude and longitude (so location), maintenance level, or floor area.In that case, a considerable error will be made in the market value forecast.On the contrary, the less impactful variables are the installations and technologies such as air conditioning, mechanical ventilation, alarm, lift, or building automation.Among the less impactful variables are also the basement, the shared garden and the floor.

DISCUSSION AND CONCLUSIONS
This work has integrated real estate market analysis with multi-parametric market value assessment techniques, computer programming and machine learning procedures to investigate market opacity in Italy.First, an automated web crawler developed in Python language helped to rapidly collect a considerable amount of observations describing the properties on sale in Bologna, Padova and Treviso.Based on these three databases, three corresponding Artificial Neural Networks have been trained in Python in order to forecast the market value of a property as a function of 32 input characteristics, including, among the others, location, maintenance level, installations and technologies, bounding typology, terrace, garage, or garden.Then, the three databases have been recollected a second time by hand.This procedure allowed to control every piece of information stated in the ads.Wrong information was corrected, and nonverifiable observations were excluded.Besides, this multi-parametric analysis has also allowed to identify the most impactful variables on the estimate.This is crucial as those building characteristics/ information must be checked carefully to assess the market value properly.Therefore, a higher opacity in the most impactful variables will lead to a higher error in the forecast.In contrast, lower transparency in the less impactful variables would produce a minor error in the estimate.
At the end of this research, the Authors identify the need to improve the sharing of data and information about real estate properties in Italy.Historical transaction prices should be made available, but also the descriptions of the assets should be much more precise and complete.Moreover, selling ads could require a minimum level of data before being considered ready for online selling.Besides, the information provided in the ads must be accurate and correct, especially regarding the property's location.
Finally, selling and ads should be more transparent; even different websites could respect a minimum-shared layout to ensure completeness and clarity.
The purpose is twofold: first, to reduce the information asymmetry between the seller and the buyer so that the demand could be able to move more consciously in the real estate market.Secondly, to increase transparency in the market, since those data are used by companies that analyse the market, produce reports, and publish prices, and by valuers, who sometimes have to use asking prices because transaction prices cannot be found.Moreover, since the real estate market has become complex and has substantial implications for the rest of the economy, all operators must be assured of the quality of information collected.
The debate analysed here revolves around the concept of "quality of information".Of course, the market investigation cannot disregard the in-depth analysis of every comparable, whether buying, selling, or bidding.However, it also seems constructive to focus on the procedures rather than the type of data: appropriate methodologies allow even spurious information to be approached professionally and constructively, drawing meaningful insights.
In further developments of this research, the Authors intend to periodically apply the methodology adopted in other Italian real estate markets to map the different levels of opacity and to understand whether there is an evolution over time regarding access to information.In particular, it will be interesting to see whether the dynamics that have occurred in recent years (Covid-19 pandemic, the war in Ukraine, energy crisis, inflation) may, in some way, impact, not only the dynamics and prices of real estate, but also the level of transparency or opacity.
journal valori e valutazioni
rivista valori e valutazioni is the effective figure at which a property has been sold in an open market, and it is, therefore, historical data that can only be observed once the transaction has occurred.Conversely, value is a prior estimate of a price or, in other words, an estimate of the most likely figure that would be paid if a property were sold in an open market on the date of the valuation.Prices are historical facts, while values are hypothetical estimates of prices.Therefore, prices and values are profoundly different concepts, occurring at different negotiation times and embodying different roles.The difference between prices and values is called valuation accuracy.As, again, well described in(French et al., 2021), valuation accuracy is where the market valuation (the estimate) differs from the actual price achieved.Valuation accuracy represents, therefore, how reliable the value assessment has been.This leads to the introduction of another important concept, namely valuation variability.The difference between two (or more) valuations performed by different professionals about the same property, at the same time, in the same market, is called valuation variability.Thus, we are talking about valuation accuracy, where market valuations differ from the actual sale prices (adjusted for time).However, when we look at how one valuation of the same asset differs from another, this is valuation variability.Valuation accuracy and valuation variability, to a certain extent, both indicate the "error" that is made in the process of estimating the value of a property if compared to the actual sale price.A value assessment only is an estimate of a price based on the available information when the valuation is done, it is not an exact calculation of a figure.No mathematical formula or sophisticated econometric algorithm indicates a property precise value.It is challenging to obtain highly reliable assessments of market values since valuations are always characterised by uncertainty, which the valuer's skill, intuition, competence and experience cannot eliminate.

Table 1 -
Class of objects and functions

Table 2 -
Percentage of lost advertisment per class The training sets employed to train the networks are formed by randomly selecting 60% of the observations of the DBscrawler per each city.Then, the selection sets are defined by randomly journal valori e valutazioni No. 31 -2022 41

Table 4 -
These threejournal valori e valutazioni No. 31 -2022 RF importance coefficients cleaned" databases have been implemented on the Artificial Neural Networks developed before: the error produced on the forecasts represents the error in the estimate of the market value due to market opacity (or, in other words, due to the wrong information contained in the ads).Then, a feature importance analysis was performed based on the Random Forest methodology.As a result, this research can help understand how and how much market opacity in Italy affects the reliability of property valuations.Artificial Neural Networks are adequate forecasting statistical procedures, and neural network models accurately describe any input-output relationship.Using a neural network model to compare the results of an opaque database versus a "clean" database has led to determining how much a valuer misses the best estimate due to a lack of market transparency. " No. 31 -2022 -output neuron from DB hand ANN(B) crawler , -ANN developed on DB crawler for Bologna ANN(P) crawler , -ANN developed on DB crawler for Padova ANN(T) crawler -ANN developed on DB crawler for Treviso ANN/ANNs -Artificial Neural Network/Artificial Neural Networks ANN crawler -Artificial Neural Network developed on the basis of the DB crawler bz -bias function in a neuron DB(B) crawler -database collected via the web crawler for Bologna DB(B) hand -database collected by-hand for Bologna DB(P) crawler -database collected via the web crawler for Padova DB(P) hand -database collected by-hand for Padova DB(T) crawler -database collected via the web crawler for Treviso DB(T) hand -database collected by-hand for Treviso DB crawler -database collected via the web crawler DB hand -database collected by-hand RF -Random Forest U -number of artificial synapsis entering in a neuron wz,u -weight function in a neuron xz,u -numerical inputs in a neuron Yz -numerical output in a neuron z -activation function in a neuronAbstractA tal fine, la trasparenza viene indagata con l'ausilio di tecniche di Machine learning, e si sviluppa un set di Reti Neurali Artificiali (Artificial Neural Network, le ANN) con l'obiettivo di individuare le principali fonti di opacità del mercato.In particolare, le ANN vengono impiegate come uno strumento di previsione multi-parametrica in grado di stimare il valore di mercato di un immobile in funzione delle diverse caratteristiche descrittive di ciascun edificio.Il training delle ANN si svolge su database ottenuti da specifici siti commerciali di offerte immobiliari disponibili online, grazie all'impiego di un processo di downloading automatico sviluppato ad-hoc con il linguaggio informatico Python®.Questi modelli di previsione riflettono l'«opacità del mercato» perché sono addestrati su dati relativi ai prezzi di offerta, come si illustrerà più avanti, che per loro natura contengono informazioni incomplete, fuorvianti o addirittura erronee.In un secondo momento, i database sono stati verificati puntualmente, al fine di controllare la correttezza di tutte le informazioni riportate negli annunci di vendita, producendo così database più trasparenti.Le ANN sono poi implementate sui database corretti, consentendo di conoscere quanto le informazioni poco attendibili influenzino la valutazione del valore di mercato.