You may have to register before you can download all our books and magazines, click the sign up button below to create a free account.
What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems. From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. Among the many topics covered, you’ll discover how to: Test drive your data to see if it’s ready for analysis Work spreadsheet data into a usable form Handle encoding problems that lurk in text data Develop a successful web-scraping effort Use NLP tools to reveal the real sentiment of online reviews Address cloud computing issues that can impact your analysis effort Avoid policies that create data analysis roadblocks Take a systematic approach to data quality analysis
Managing multiple Red Hat-based systems can be easy--with the right tools. The yum package manager and the Kickstart installation utility are full of power and potential for automatic installation, customization, and updates. Here's what you need to know to take control of your systems.
R is a wonderful thing, indeed: in recent years this free, open-source product has become a popular toolkit for statistical analysis and programming. Two of R's limitations -- that it is single-threaded and memory-bound -- become especially troublesome in the current era of large-scale data analysis. It's possible to break past these boundaries by putting R on the parallel path. Parallel R will describe how to give R parallel muscle. Coverage will include stalwarts such as snow and multicore, and also newer techniques such as Hadoop and Amazon's cloud computing platform.
It’s tough to argue with R as a high-quality, cross-platform, open source statistical software product—unless you’re in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets, including three chapters on using R and Hadoop together. You’ll learn the basics of Snow, Multicore, Parallel, Segue, RHIPE, and Hadoop Streaming, including how to find them, how to use them, when they work well, and when they don’t. With these packages, you can overcome R’s single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R’s memory barrier. Snow: works well in a traditional cluster environment Multicore: popular for multiprocessor and multicore computers Parallel: part of the upcoming R 2.14.0 release R+Hadoop: provides low-level access to a popular form of cluster computing RHIPE: uses Hadoop’s power with R’s language and interactive shell Segue: lets you use Elastic MapReduce as a backend for lapply-style operations
You're sitting on a pile of interesting data. How do you transform that into money? It's easy to focus on the contents of the data itself, and to succumb to the (rather unimaginative) idea of simply collecting and reselling it in raw form. While that's certainly profitable right now, you'd do well to explore other opportunities if you expect to be in the data business long-term. In this paper, we'll share a framework we developed around monetizing data. We'll show you how to think beyond pure collection and storage, to move up the value chain and consider longer-term opportunities.
Create and publish your own interactive data visualization projects on the webâ??even if you have little or no experience with data visualization or web development. Itâ??s inspiring and fun with this friendly, accessible, and practical hands-on introduction. This fully updated and expanded second edition takes you through the fundamental concepts and methods of D3, the most powerful JavaScript library for expressing data visually in a web browser. Ideal for designers with no coding experience, reporters exploring data journalism, and anyone who wants to visualize and share data, this step-by-step guide will also help you expand your web programming skills by teaching you the basics of HTM...
Annotation IT Managers, developers, data analysts, system architects, and similar technical workers are now encountering the largest and most disruptive change in their profession since the ascendancy of the relational database in early 1980s. You hear that NoSQL and Big Data Analytics are about to replace the systems and skills you now own and possess, but there's often no easy way to make that transition. To exacerbate the issue, the transition may not be gradual, but forced on you by a new project in your enterprisenamely, Hadoopthat will immediately require new ways of thinking, new tools, and new techniques. This book helps you understand the components of the Hadoop ecosystem and how they relate to each other. You'll discover how to get started on that project in an efficient manner that lays out the possibilities. The authors suggest a path and resources that will guide you on their journey from the status quo to the Brave New World you face.
If you're considering R for statistical computing and data visualization, this book provides a quick and practical guide to just about everything you can do with the open source R language and software environment. You'll learn how to write R functions and use R packages to help you prepare, visualize, and analyze data. Author Joseph Adler illustrates each process with a wealth of examples from medicine, business, and sports. Updated for R 2.14 and 2.15, this second edition includes new and expanded chapters on R performance, the ggplot2 data visualization package, and parallel R computing with Hadoop. Get started quickly with an R tutorial and hundreds of examples Explore R syntax, objects, and other language details Find thousands of user-contributed R packages online, including Bioconductor Learn how to use R to prepare data for analysis Visualize your data with R's graphics, lattice, and ggplot2 packages Use R to calculate statistical fests, fit models, and compute probability distributions Speed up intensive computations by writing parallel R programs for Hadoop Get a complete desktop reference to R.
Advanced R helps you understand how R works at a fundamental level. It is designed for R programmers who want to deepen their understanding of the language, and programmers experienced in other languages who want to understand what makes R different and special. This book will teach you the foundations of R; three fundamental programming paradigms (functional, object-oriented, and metaprogramming); and powerful techniques for debugging and optimising your code. By reading this book, you will learn: The difference between an object and its name, and why the distinction is important The important vector data structures, how they fit together, and how you can pull them apart using subsetting Th...