CONTENT IS KING. SCRAPING IS THE TROJAN HORSE.

We at we45 work with a lot of companies that develop next generation web and mobile apps. The trend that has emerged is a no-brainer. In the modern web, Content is the undisputed King. Any site from eCommerce to eLearning depends on great content to deliver great results to its customers and investors. If content is irrelevant, slow or even marginally off balance, then the app is canned faster than yesterday’s garbage. Security is important, sure. But, there is another type of information disclosure attack that most companies are not really applying too much thought towards. Let me illustrate that with a story.

We have been working with one of the largest travel portals on the planet. During an initial penetration test, we identified certain cached URLs (on Search Engines) that had some customer booking details. This was information about the booking (Booking ID and Customer Personal Data). It was bad. But, we took it to the next level. Using an incremental set of booking id parameters (they were vulnerable to an Insecure Direct Object Reference Attack), we wrote a custom web scraper that was able to extract the personal data of users. Soon, we had 3 million customers’ booking information and had just performed a massive information disclosure attack. Now this is where it gets interesting.. When we reported this finding to our customer. Their immediate reaction was to disable caching so search engines couldn’t index this information. However, there was another major area that they had completely ignored. The fact that we were able to use and scale a simple web scraper in a matter of hours, completely eluded them, and completely confounded me. Since then, I must have worked with at least another 100 odd customers in content-driven apps, and the same patterns emerged. They just didn’t seem to think that content scraping was a problem. Or at least one that they could do much about anyway.

Let me offer a counter point. Over the years, Web scraping has become much easier. With the convergence of Big Data and mature HTTP API (in most/all programming languages), web scraping can be scaled to epic levels. Imagine scraping 3 million booking IDs into a document Database and scaling it with advanced analytics, like demographic information, location information and so on. This technology is more accessible than ever and its big business in its own right. Of course, I am not implying that web scraping is bad/illegal. It is extremely useful and is invaluable in certain ways. However, this done with a caveat (I mean legally) is what we should be looking out for.

Imagine this, you are a leading niche executive recruitment app. You allow businesses and job-seekers an exclusive experience in connecting leading business executives and great companies to work at. One random member of the public (or a competitor) buys an account on your system and scrapes content, including so-called complex JavaScript content using a custom scraper that she stitches up in a few hours. Now, all the time that you have spent building your business and database, has completely gone to nought. Yes, you may have tested your app for SQL Injection, Cross Site Scripting, Authentication Flaws, Crypto flaws and whatnot, but you have failed to protect a major dimension of your business. And this can be devastating.

Long story short. Let me say that content scraping can be a serious problem in any content-driven app and you need to look into it seriously, as an integral part of your security program. Let’s explore some ways.

‍1. Consider it a Security Issue

‍I have never seen content scraping being thought of as a security issue. If you see it as a problem, you can solve it. In most cases, content scraping is ridiculously easy and scalable. You need to identify it right from the first stages of your lifecycle. It has to become an essential in your UI design, your backend design, your web services and more. Its a problem. Better get with the program.

‍2. Know that you can do something about it

‍Whether it be regular markup changes, request throttling, JavaScript-driven controls, or gasp…. CAPTCHAs, there are creative and interesting ways to solve or deter the scraping problem. You can solve it if you can do (1) very seriously as part of your security program.

‍3. Include it in your Pentests

‍Pentests are useful to discover all kinds of typical and business logic driven security vulnerabilities. If you are an app that depends on content, then include it as a key line-item in your App Pentests. I have seen very few companies do it in their Penetration Testing Programs and they are incredibly benefited by this exercise. Its just another “out of the box” way of strengthening your app’s security.

‍Conclusion

‍Scraping is not new. Its been there since the early days of the web. Which is why its surprising that most companies dont/rarely think about it as a security issue. In the spirit of “Defense-in-Depth” and good business (driven from content-driven apps) it should be an important factor for your app’s security.You may scrape this article freely, but with proper attribution ;)