<<Introduction>>
The idea was to create
tool for Serbian stock market analysis. Since the daily stock
exchange reports were presented in plain HTML, packed with poorly
organized tables that contained data for transactions and companies
characteristics, it was immposible to process data automaticly.
None of the tools for data analysing were presented on the official
stock exchange site. It was needed to manualy sort data, filter
data, make calculations and interpret results at the end.
The agency is projected for collecting and analysing data from
official stock market web site. It is designed to change managers,
brockers and statystical analysers. It is completly automatized
and because of that is based on an agent framework. Agents are
in chage for operational level. They collect, analyse and store
data for further use. The end user has the opportunity to access
the agency over the internet. In that way he can use on-line tool
for data analysys.
On the Picture 1 you can see the parts of the system. Communication
and work of the agents are based on JADE environment.
|
Picture 1: Overview of the agency
system
|
<< Process of collecting
and analysing documents >>
Process of collecting
and analysing documents is conducted in 4 phases:
- HTML parsing and table exraction
- HTML transforming into object code
- Searching for patterns in the tables
- Using table patterns to extract data
The picture of the proccess is given below (Picture 2):
|
Picture 2: Process of collecting
and analysing documents
|
Two type of agents are involved in this proccess:
Parsers and Analysers. Parser is responsible for
the first and the second part of the proccess, and for the other
two parts, Analyser is responsible.All agents must have
their services registered by the Coordinator agent.
1.HTML parsing and table exraction
A number of Parsers is cloned by the Coordinator
agent. Then pages are downloaded by Parsers and tables
are extracted.
Input in this phase is HTML code of the page and on the output we
have array of tables.
2. HTML transforming into object code
This is the part where, for the each table, extracted in
the previous step, an object is created.
Input in this phase is the HTML table and the output is the object
in memory.
3. Searching for patterns in the tables
A number of Analysers is cloned by the Coordinator
agent. Then Analysers look in the pattern data base and
if they find previously stored pattern which has the most similar
characteristics as the one currently analysed, the method of data
extraction, associated with that pattern, is used to extract data
from the current table. But if there is no pattern stored before,
an agent Plowher comes in act. His purpose is to find patterns
in data that are containated in the tables. First he loads a sample
which contains, for example 100, tables. Then he looks for the similar
table cells that are positioned at the same places in all tables.
If he can determine the existence of the same structures in all
tables pattern is founded. Pattern is memorized like XML along with
the rules of the data extraction for the associated table. If it
is not possible to find pattern, an opportunity to do that is left
for the administrator of the system. Sample picture of the interface
is given below (Picture 3):
|
Picture 3: Administrator
tool for extraction rule making |
Input in this phase is the table object in memory
and output is the XML pattern with rules for data extraction.
4.Using table patterns to extract data
Each pattern that is stored in the data base is used for the extraction
of data from the tables, based on the rules that it contains. When
the currently analysed table has pattern similar to the one from
the data base, the number that represents the percent of confirmation
is calculated by the Bernuli criteria.
Input in this phase is unknown table and output is the XML document
with data from table.
<< Exapmle >>
At the begginig of the proccess
we have the table like this :
| Naziv i vrsta zemljišta-objekta |
Namena zemljišta-objekta |
Površina zemljišta-objekta (m2) |
| Građevinsko zemljište |
Za građevinske objekte |
8.380 |
| Proizvodna hala |
Proizvodnja |
3.000 |
| Upravna zgrada |
Kanc. prostor I pomoćne prostorije |
1.400 |
| Garaža |
Smeštaj vozila |
350 |
| Magacin gotove robe |
Smeštaj gotovih proizvoda |
1.980 |
:: Sample table with data written in Serbian language ::
At the end of the proccess
we have the XML like this:
| <XML
xmlns="http://tempuri. org/Header_zemljiste. xsd">
<zemljiste>
<instance>
<Naziv_i_vrsta_zemljištaobjekta>Graðevinsko
zemljište</Naziv_i_vrsta_zemljištaobjekta>
<Namena_zemljištaobjekta>Za
graðevinske objekte</Namena_zemljištaobjekta>
<Površina_zemljištaobjekta__m2>8.
380</Površina_zemljištaobjekta__m2>
</instance>
<instance>
<Naziv_i_vrsta_zemljištaobjekta>Proizvodna
hala</Naziv_i_vrsta_zemljištaobjekta>
<Namena_zemljištaobjekta>Proizvodnja</Namena_zemljištaobjekta>
<Površina_zemljištaobjekta__m2>3.
000</Površina_zemljištaobjekta__m2>
</instance>
<instance>
<Naziv_i_vrsta_zemljištaobjekta>Upravna
zgrada</Naziv_i_vrsta_zemljištaobjekta>
<Namena_zemljištaobjekta>Kanc. prostor
I pomoæne prostorije</Namena_zemljištaobjekta>
<Površina_zemljištaobjekta__m2>1.
400</Površina_zemljištaobjekta__m2>
</instance>
<instance>
<Naziv_i_vrsta_zemljištaobjekta>Garaa</Naziv_i_vrsta_zemljištaobjekta>
<Namena_zemljištaobjekta>Smeštaj
vozila</Namena_zemljištaobjekta>
<Površina_zemljištaobjekta__m2>350</Površina_zemljištaobjekta__m2>
</instance>
<instance>
<Naziv_i_vrsta_zemljištaobjekta>Magacin
gotove robe</Naziv_i_vrsta_zemljištaobjekta>
<Namena_zemljištaobjekta>Smeštaj
gotovih proizvoda</Namena_zemljištaobjekta>
<Površina_zemljištaobjekta__m2>1.
980</Površina_zemljištaobjekta__m2>
</instance>
</zemljiste>
</XML>
|
|