Leseprobe

Automated Data Collection with R

eBook - A Practical Guide to Web Scraping and Text Mining

Rubba, Christian/Meißner, Peter/Munzert, Simon et al

WILEY

Mathematik/Wahrscheinlichkeitstheorie, Stochastik, Mathematische Statistik

Erschienen am 24.10.2014

60,99 €

(inkl. MwSt.)

E-Book Download

Download

Auf Wunschliste

Bibliografische Daten

ISBN/EAN: 9781118834787

Sprache: Englisch

Umfang: 480 S., 8.25 MB

Auflage: 1. Auflage 2014

E-Book
Format: PDF
DRM: Adobe DRM

Beschreibung

A hands on guide to web scraping and text mining for both beginners and experienced users of R

Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.Provides basic techniques to query web documents and data sets (XPath and regular expressions).An extensive set of exercises are presented to guide the reader through each technique.Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.Case studies are featured throughout along with examples for each technique presented.R code and solutions to exercises featured in the book are provided on a supporting website.

Autorenportrait

Simon Munzert is the author ofAutomated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Christian Rubba is the author ofAutomated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Peter Meißner is the author ofAutomated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Dominic Nyhuis is the author ofAutomated Data Collection with R: A Practical Guide to Web Scraping and Text Mining, published by Wiley.

Inhalt

Preface xv

1 Introduction 1

1.1 Case study: World Heritage Sites in Danger 1

1.2 Some remarks on web data quality 7

1.3 Technologies for disseminating, extracting, and storing web data 9

1.4 Structure of the book 13

Part One A Primer on Web and Data Technologies 15

2 HTML 17

2.1 Browser presentation and source code 18

2.2 Syntax rules 19

2.3 Tags and attributes 24

2.4 Parsing 32

3 XML and JSON 41

3.1 A short example XML document 42

3.2 XML syntax rules 43

3.3 When is an XML document well formed or valid? 51

3.4 XML extensions and technologies 53

3.5 XML and R in practice 60

3.6 A short example JSON document 68

3.7 JSON syntax rules 69

3.8 JSON and R in practice 71

4 XPath 79

4.1 XPath--a query language for web documents 80

4.2 Identifying node sets with XPath 81

4.3 Extracting node elements 93

5 HTTP 101

5.1 HTTP fundamentals 102

5.2 Advanced features of HTTP 116

5.3 Protocols beyond HTTP 124

5.4 HTTP in action 126

6 AJAX 149

6.1 JavaScript 150

6.2 XHR 154

6.3 Exploring AJAX with Web Developer Tools 158

7 SQL and relational databases 164

7.1 Overview and terminology 165

7.2 Relational Databases 167

7.3 SQL: a language to communicate with Databases 175

7.4 Databases in action 188

8 Regular expressions and essential string functions 196

8.1 Regular expressions 198

8.2 String processing 207

8.3 A word on character encodings 214

Part Two A Practical Toolbox forWeb Scraping and Text Mining 219

9 Scraping the Web 221

9.1 Retrieval scenarios 222

9.2 Extraction strategies 270

9.3 Web scraping: Good practice 278

9.4 Valuable sources of inspiration 290

10 Statistical text processing 295

10.1 The running example: Classifying press releases of the British government 296

10.2 Processing textual data 298

10.3 Supervised learning techniques 307

10.4 Unsupervised learning techniques 313

11 Managing data projects 322

11.1 Interacting with the file system 322

11.2 Processing multiple documents/links 323

11.3 Organizing scraping procedures 328

11.4 Executing R scripts on a regular basis 334

Part Three A Bag of Case Studies 341

12 Collaboration networks in the US Senate 343

12.1 Information on the bills 344

12.2 Information on the senators 350

12.3 Analyzing the network structure 353

12.4 Conclusion 358

13 Parsing information from semistructured documents 359

13.1 Downloading data from the FTP server 360

13.2 Parsing semistructured text data 361

13.3 Visualizing station and temperature data 368

14 Predicting the 2014 Academy Awards using Twitter 371

15 Mapping the geographic distribution of names 380

15.1 Developing a data collection strategy 381

15.2 Website inspection 382

15.3 Data retrieval and information extraction 384

15.4 Mapping names 387

15.5 Automating the process 389

16 Gathering data on mobile phones 396

16.1 Page exploration 396

16.2 Scraping procedure 404

16.3 Graphical analysis 406

16.4 Data storage 408

17 Analyzing sentiments of product reviews 416

17.1 Introduction 416

17.2 Collecting the data 417

17.3 Analyzing the data 426

17.4 Conclusion 434

References 435

General index 442

Package index 448

Function index 449

Informationen zu E-Books

Herzlichen Glückwunsch zum Kauf eines Ebooks bei der BUCHBOX! Hier nun ein paar praktische Infos.

Adobe-ID

Hast du E-Books mit einem Kopierschutz (DRM) erworben, benötigst du dazu immer eine Adobe-ID. Bitte klicke einfach hier und trage dort Namen, Mailadresse und ein selbstgewähltes Passwort ein. Die Kombination von Mailadresse und Passwort ist deine Adobe-ID. Notiere sie dir bitte sorgfältig.

Achtung: Wenn du kopiergeschützte E-Books OHNE Vergabe einer Adobe-ID herunterlädst, kannst du diese niemals auf einem anderen Gerät außer auf deinem PC lesen!!

Du hast dein Passwort zur Adobe-ID vergessen? Dann kannst du dies HIER neu beantragen.

Adobe Digital Editions Hilfe-Seite

Lesen auf dem Tablet oder Handy

Wenn du auf deinem Tablet lesen möchtest, verwende eine dafür geeignete App.

Für iPad oder Iphone etc. hole dir im iTunes-Store die Lese-App Bluefire

Für Android-Geräte (z.B. Samsung) bekommst du die Lese-App Bluefire im GooglePlay-Store (oder auch: Aldiko)

Lesen auf einem E-Book-Reader oder am PC / MAC

Um die Dateien auf deinen PC herunter zu laden und auf dein E-Book-Lesegerät zu übertragen gibt es die Software ADE (Adobe Digital Editions).

Hier kommst du direkt zu den Downloads

Andere Geräte / Software

Kindle von Amazon. Wir empfehlen diese Geräte NICHT.

EPUB mit Adobe-DRM können nicht mit einem Kindle von Amazon gelesen werden. Weder das Dateiformat EPUB, noch der Kopierschutz Adobe-DRM sind mit dem Kindle kompatibel. Umgekehrt können alle bei Amazon gekauften E-Books nur auf dem Gerät von Amazon gelesen werden. Lesegeräte wie der Tolino sind im Gegensatz hierzu völlig frei: Du kannst bei vielen tausend Buchhandlungen online Ebooks für den Tolino kaufen. Zum Beispiel hier bei uns.

Software für Sony-E-Book-Reader

Wenn du einen Sony-Reader hast, dann findest du hier noch die zusätzliche Sony-Software.

Computer/Laptop mit Unix oder Linux

Die Software Adobe Digital Editions ist mit Unix und Linux nicht kompatibel. Mit einer WINE-Virtualisierung kommst du aber dennoch an deine E-Books.

Automated Data Collection with R