This is an open thread and the goal is to solicit comments on what the best web scraping service may look like. Please go ahead a type away and write down the ideas or requirements…
The world runs on data but very few people care to follow the flow of data and explore its origins and how it ends up in the data products they consume.
The process of sausage making is somewhat similar and people don’t know what meat from what animal goes into the sausages they enjoy eating.
Data is created by the data creators or data sources, some of them humans and some of them machines. We all play a role in this data creation – whether it is wittingly and by our knowledge (me writing this article) or unwittingly and without our knowledge (using my phone or browsing a website – all these activities create a lot of data).
Data consciousness has not reached the levels of Environmental consciousness and as we start to think more about data, we will realize that the exploration of the origins of data and the data itself is critical to understanding how the future will be shaped.
Data plays an integral role in training the algorithms that are starting to play a critical role in our lives and some very tragically such as the Boeing 737 Max algorithm or self driving car algorithms (Tesla) that did not account for data related to jaywalkers. Algorithms are not innocent academic exercises that are run in labs anymore, they are actively affecting our lives today and will continue to make a far greater impact on our lives in the future.
Algorithms feed on Data – it is their food, oxygen and water.
It is hence imperative that we understand the sources of data, the availability of accurate and diverse sets of data, the manipulation of this data and the end products developed using this data.
We all need to be Data Conscious and understand what’s in our data sausage. The sausage is all tasty and great while all is well but the moment we have a problem, we need to understand the cause of the problem to identify the ways on how to fix the problem.
We are all aware of the massive amount of data collection conducted by the large tech companies and we also know of large financial information that is collected by Banks, Credit Ratings agencies and Data aggregators about us and we also recently found out that Google has access to a huge amount of sensitive Health information (November 2019).
We are usually oblivious about the pervasive nature of this data collection and when we actually see what’s in the data that any of these companies collect about us, we are shocked, angry and terrified – all at the same time.
We also know that companies “enrich” their data by taking various data sets and combining them and once they do that, the composite profile that they have about us is incredibly detailed and the details are broken down by millisecond timestamps and precisely coded geolocation data – scary!
Yet, most of us prefer the adage “Ignorance is Bliss” because it keeps us happy, sane, worry free and reminds us that there is more good than evil in this world.
We, at ScrapeHero are in the Data business and we gather data from public sources for some of the largest companies in this world. Google does the same everyday and at a scale and cost that we cannot even fathom. However, a small percentage of our potential customers are turned off by the term “Web Scraping”. Scraping is in our name – ScrapeHero and all over our website and most companies have no issues with that term. This small percentage of companies would rather not associate with the term “Web Scraping”, yet on the same hand they purchase data from companies which do not explicitly say they gather this data from web scraping at 100 times or even 1000 times the cost! The brand name of the data provider company and high cost of the data adds credence and value to the data and blinds the buyers of the data to the source of this data. This is similar to a nice package of sausage with a fancy set of words on the package.
The difference between food labeling and data labeling is that there is no such thing as a Data Label whereas Food Labels have to exist and be truthful.
These data provider companies do not create the data in most cases, so how do they get it? What is in their Data Sausage and where did they get this data if they did not type it in themselves (just like I am typing in this article and creating this data). These data companies project themselves as occupying a moral high ground but if pressed to explain the source of their data, they are either going to be unaware of the sources or will obfuscate the true sources of this data. All this data is also merged with other public and private data and from the resulting data it is hard to get back to the antecedents of this combined data. Not any different from trying to trace back the meat in the sausage.
The value of data also varies and a single source of data has a much lower value than a dataset that combines multiple sets of data that creates a composite profile.
We are pretty transparent about where we get the data and what is in our Data Sausage – there is no “mystery meat” and no obfuscation. While that makes us a non-palatable data source for a small number of companies, we are highly sought out in highly regulated industries such as the financial industry where the source of the data is just as important as the data itself.
Eventually, a majority of data can be gathered from the data creators or it is gathered through web scraping whether it is publicly acknowledged or accepted or not.
There is also another source of data – the surreptitious and pervasive data gathering companies and governments engage in while providing necessary or free services.
Enjoy that sausage but do worry about what goes into it – at least once in a while.