Extract the data you need from Internet!

Extract the data you need from Internet!

Imagine you have found a really interesting website with information that you need for your project.
However, that website does not provide a data feed or it is not in a well-structured format for automatic processing. How will you extract the data?

This is the case of the CIA World Factbook website, a great public dataset of countries with information such as population, area and number of mobile phones. The data in the website is in a semi-structured form in a way that you cannot import it into MsExcel or mysql database. Importing this database would allow you to make queries such as "show me the list of countries with a GPD per capita greater than $20.000 and more than 50 million cell phones", or apply statistics and data-mining techniques. See here how this dataset looks like once imported.

Extracting databases from websites is extremely useful for:

* creating smart applications
* trends and business analysis
* monitoring business competitors
* direct marketing
* getting datasets for scientific experiments

In this workshop we will explain how to extract data from this type of websites.
We will first review existing and related technologies (Dapper, Web-Harvest, OpenKapow, Yahoo Pipes...)
and then introduce our open-source web information extraction toolkit.
Target audience: software developers.

For a higher-level picture, and to discuss the legal aspects of extracting and using data from websites,
you may want to attend this other workshop.

Contact: Charles François Rey <charlesfr.rey@db4all.com>
DB4ALL.com

PS: If you have questions, comments or suggestions about this workshop, do not hesitate to contact us prior to the workshop.


Preferred time: 
16:00
20 individuals signed up
Chris Hofmann
Jérôme De Vries
Kadir Topal
Hélène De Ribaupierre
Manuel Donze
Gregory Barbezat
Guerdat Yannick
Raphaël Briner
Patrick OLLIVIER
Justine Andrieu
Sarah Wade Hutman
Hannes Gassert
Stephan Baumann
Thierry Chauvin
Nicolas Aguttes
Léonard Studer
Quentin Bonnard
Richard Chappuis
Marc Cathomen
Ralf Bickel
Room: 
12

Comments

Hello,

Thanks for registering to the workshop "Extract the data you need from Internet!".

We are asking the participants to tell us a bit about their expectations about this workshop,
so that we can better target it.

I see two main options:
1- You bring your laptop, we give you a tutorial and you start programming using Webminer.
We are there to answer your questions.

2- We explain the concepts around Webminer using slides and some examples.

Please let us know your preferences.

Best regards,
DAvid Portabella

ps: please take care to answer to david.portabella@gmail.com.
(do not answer to info@liftconference.com. otherwise the LIFT moderators need to forward me your email)


Hello,

Thank your for your answers.

Some people prefer to bring their laptop and to develop.
For those, you need to install Java SDK 1.5 or 1.6 and Ant in advance.
http://ant.apache.org/

I'll present some slides + discussion for those that prefer a more high level picture.

However, please note that this workshop is mainly targeted to software developers.
To get a more high level picture, you may attend this other workshop:
http://liftconference.com/get-data-your-commercial-project

Best regards,
DAvid Portabella
http://db4all.com


Syndicate content