Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20


I Joko Dewanto
 Data Analysts are experienced data professionals in their
organization who can query and process data, provide
reports, summarize and visualize data. They have a strong
understanding of how to leverage existing tools and
methods to solve a problem, and help people from across
the company understand specific queries with ad-hoc
reports and charts.
 However, they are not expected to deal with analyzing big
data, nor are they typically expected to have the
mathematical or research background to develop new
algorithms for specific problems.
 Skills: Data Analysts need to have a baseline
understanding of some core skills: statistics, data munging,
data visualization, exploratory data analysis,
Tools: Microsoft Excel, SPSS, SPSS Modeler, SAS, SAS
Miner, SQL, Microsoft Access, Tableau, SSAS.
 Business Intelligence Developers are data experts that interact
more closely with internal stakeholders to understand the
reporting needs, and then to collect requirements, design,
and build BI and reporting solutions for the company. They
have to design, develop and support new and existing data
warehouses, ETL packages, cubes, dashboards and analytical
 Additionally, they work with databases, both relational and
multidimensional, and should have great SQL development
skills to integrate data from different resources. They use all of
these skills to meet the enterprise-wide self-service needs. BI
Developers are typically not expected to perform data
 Skills: ETL, developing reports, OLAP, cubes, web
intelligence, business objects design,
Tools: Tableau, dashboard tools, SQL, SSAS, SSIS and SPSS
 Data Engineers are the data professionals who prepare the “big
data” infrastructure to be analyzed by Data Scientists. They are
software engineers who design, build, integrate data from various
resources, and manage big data. Then, they write complex queries
on that, make sure it is easily accessible, works smoothly, and
their goal is optimizing the performance of their company’s big
data ecosystem.
 They might also run some ETL (Extract, Transform and Load) on
top of big datasets and create big data warehouses that can be
used for reporting or analysis by data scientists. Beyond that,
because Data Engineers focus more on the design and
architecture, they are typically not expected to know any machine
learning or analytics for big data.
 Skills: Hadoop, MapReduce, Hive, Pig, Data streaming, NoSQL,
SQL, programming.
Tools: DashDB, MySQL, MongoDB, Cassandra

A data scientist is the alchemist of the 21st century: someone who can turn raw
data into purified insights. Data scientists apply statistics, machine learning
and analytic approaches to solve critical business problems. Their primary
function is to help organizations turn their volumes of big data into valuable
and actionable insights.
 Indeed, data science is not necessarily a new field per se, but it can be
considered as an advanced level of data analysis that is driven and automated
by machine learning and computer science. In another word, in comparison
with ‘data analysts’, in addition to data analytical skills, Data Scientists are
expected to have strong programming skills, an ability to design new
algorithms, handle big data, with some expertise in the domain knowledge.
 Moreover, Data Scientists are also expected to interpret and eloquently deliver
the results of their findings, by visualization techniques, building data science
apps, or narrating interesting stories about the solutions to their data
(business) problems.
 The problem-solving skills of a data scientist requires an understanding of
traditional and new data analysis methods to build statistical models or
discover patterns in data. For example, creating a recommendation engine,
predicting the stock market, diagnosing patients based on their similarity, or
finding the patterns of fraudulent transactions.
 Data Scientists may sometimes be presented with big data without a
particular business problem in mind. In this case, the curious Data
Scientist is expected to explore the data, come up with the right
questions, and provide interesting findings! This is tricky because, in
order to analyze the data, a strong Data Scientists should have a very
broad knowledge of different techniques in machine learning, data
mining, statistics and big data infrastructures.
 They should have experience working with different datasets of
different sizes and shapes, and be able to run his algorithms on large
size data effectively and efficiently, which typically means staying up-
to-date with all the latest cutting-edge technologies. This is why it is
essential to know computer science fundamentals and programming,
including experience with languages and database (big/small)
 Skills: Python, R, Scala, Apache Spark, Hadoop, machine learning,
deep learning, and statistics.
Tools: Data Science Experience, Jupyter, and RStudio.
 Design, construct, install, test and maintain highly scalable data
management systems
 Ensure systems meet business requirements and industry practices
 Build high-performance algorithms, prototypes, predictive models and
proof of concepts
 Research opportunities for data acquisition and new uses for existing
 Develop data set processes for data modeling, mining and production
 Integrate new data management technologies and software
engineering tools into existing structures
 Create custom software components (e.g. specialized UDFs) and
analytics applications
 Employ a variety of languages and tools (e.g. scripting languages) to
marry systems together
 Install and update disaster recovery procedures
 Recommend ways to improve data reliability, efficiency and quality
 Collaborate with data architects, modelers and IT team members on
project goals
 A: One of the primary facets of Urthecast’s business is to
provide our data for others to consume in a variety of ways,
whether their focus involves science, government,
education, or business. In essence, we provide data for
others to analyze and build upon. Thus, the impact of our
data engineers is extremely important. Our success lies in
both the quality and the quantity of what we can offer.
Because of this, we have data engineers working in a variety
of capacities along the entire data pipeline – from working
with the raw data, perfecting it with geospatial raster
processes (georeferencing, orthorectification, and
mosaicing), all the way to building APIs for developer
 In addition, as with most companies, we manage data
beyond our core product. For example, we analyze log files
and customer-use patterns. All companies benefit from
this knowledge, which turns into useful business metrics.
 A: The skills and tools that are utilized on the job are highly
dependent on which part of the data pipeline you focus on. For
myself, I’m at the tail end of the pipeline building APIs for data
consumption, integrating external datasets, and analyzing how
our data is used to further improve our end product.
 With APIs, I really feel web languages are sufficiently robust, so
it’s not as important which one you choose as long as it is
embraced as a common language amongst your team. Our
environment relies heavily on both PHP and Python. Almost all
of my code relating to data ingestion from other providers is
written in Python. It is uncomplicated and robust and can talk to
any datastore whether it’s RDBMS or NoSQL. Lastly, for data
analysis, I use Big Data technologies, such as Spark, to forecast
and recommend improvements based on how the data is
 A: We are in a data revolution, and these are exciting times. Data
used to be viewed as a simple necessity and lower on the totem
pole. Now it is more widely recognized as the source of truth. As
we move into more complex systems of data management, the
role of the data engineer becomes extremely important as a
bridge between the DBA and the data consumer. Beyond the
ubiquitous spreadsheet, graduating from RDBMS (which will
always have a place in the data stack), we now work with NoSQL
and Big Data technologies.
 As the tools and processes become more complex, and because
raw data is always dirty, data engineers will always have a place in
the workforce. I do think the tools we use will become more
refined and more powerful, but I don’t see raw data ever arriving
clean. Also, I think we will have new, as-yet-unknown to us, data
models that will keep things fresh and keep data engineers
always learning.
 A: Data engineers are the plumbers building a data pipeline,
while data scientists are the painters and storytellers giving
meaning to an otherwise static entity. Simply put, data engineers
clean, prepare and optimize data for consumption. Once the
data becomes useful, data scientists can perform a variety of
analyses and visualization techniques to truly understand the
data, and eventually, tell a story from the data. All data has a
story to tell.
 The communication between a data engineer and a data scientist
is vital. Typically, data is not just thrown in a database awaiting
consumption. It needs to be optimized to the use case of the data
scientist. Having a clear understanding of how this handshake
occurs is important in reducing the human error component of
the data pipeline.
 Personally, I’m a fan of providing data access via an API. This
allows scientists to focus on what they can do with the data
rather than how to access the data. Not everyone understands
SQL, and not everyone writes good SQL. PDFs and spreadsheets
have their place in the board room. With a well-written RESTful
API, the data engineer is able to provide the data scientist with
either exactly what they want or the means to access raw data
and then build their final product.
 Lastly, I’ll just say that it’s important for data scientists to be
appreciative of an engineer’s work. Last year, the NY Times wrote
that 50 to 80 percent of a data scientist’s job is cleaning data.
That is not the case once you have a team of data engineers on
board, allowing the data scientist to focus on analytics.
eel a data engineer should have the following traits:
 Mechanical tendencies. A curiosity to know how things
work and how to make them better.
 Patience. Nothing will work the first time; there are just too
many moving parts.
 Humility. Data engineers are the wizards of Oz. Ultimately,
you are in a support role; you help build the underlying
infrastructure. Be proud of your work and know that others
may get more of the limelight because of your efforts.
 Focus. Designing data is one of my favorite aspects of my
work, but it tends to be a smaller percentage of my day. A
data engineer should want to be in the weeds,
understanding the intricacies of how and why a data
pipeline works as it does.
 A: Of course it’s important to be fluent in the languages
and tools that will help you get hired. But more important,
I believe, is to understand what the tools are helping you to
accomplish. Languages come and go, so it’s better to gain a
full understanding of the concepts behind building a
robust pipeline.
 Also, be extremely comfortable at the command line. Text
files still reign supreme, whether it’s your own code,
csv/json/xml data, or log files.
 Lastly, find a community and get involved! Check for something in your area or local
universities, which may have study groups that you can
join. Keep a lookout for hackathons – they always need
data specialists.
Data Engineer
 Glassdoor
Average Salary (2015): $95,936 per year
Minimum: $66,000
Maximum: $117,000
Data Scientist/Engineer
 PayScale
Median Salary (2015): $91,782 per year
Total Pay Range: $58,773 – $143,419
Senior Data Engineer
 Glassdoor
Average Salary (2015): $124,338 per year
Minimum: $105,000
Maximum: $147,000
Big Data Engineer
 Robert Half Technology 2015 Salary Guide
Average Salary (2014): $110,250 – $152,750
Average Salary (2015): $119,250 – $168,250
Average Salary (2016): $129,500 – $183,500
What Kind of Degree Will I Need?
 You will need a bachelor’s degree in computer science,
software/computer engineering, applied math,
physics, statistics or a related field and a lot of real-
world skills to qualify for most entry-level positions.
 Is a master’s required? It depends on the job. Some
employers are more than willing to accept relevant
work experience and proof of technical expertise in
lieu of a higher degree.
What Kind of Skills Will I Need?
 Technical Skills
 Statistical analysis and modeling
 Database architectures
 Hadoop-based technologies (e.g. MapReduce, Hive and Pig)
 SQL-based technologies (e.g. PostgreSQL and MySQL)
 NoSQL technologies (e.g. Cassandra and MongoDB)
 Data modeling tools (e.g. ERWin, Enterprise Architect and Visio)
 Python, C/C++ Java, Perl
 MatLab, SAS, R
 Data warehousing solutions
 Predictive modeling, NLP and text analysis
 Machine learning
 Data mining
 UNIX, Linux, Solaris and MS Windows
Since new data management technologies are appearing every day, this list is
subject to change.
 Creative Problem-Solving: Approaching data organization
challenges with a clear eye on what is important; employing the
right approach/methods to make the maximum use of time and
human resources.
 Effective Collaboration: Carefully listening to management,
data scientists and data architects to establish their needs.
 Intellectual Curiosity: Exploring new territories and finding
creative and unusual ways to solve data management problems.
 Industry Knowledge: Understanding the way your chosen
industry functions and how data can be collected, analyzed and
utilized; maintaining flexibility in the face of big data
 If you’re interested in buffing up specific skills, you’ll find a
lot of vendor-specific certifications (e.g. Oracle, Microsoft,
IBM, Cloudera etc.). To determine which ones are worth
your investment, ask your mentors for advice, examine
recent job descriptions and browse articles like Tom’s IT
Pro “Best Of” certification lists for ideas.
 Certified Data Management Professional (CDMP)
 Developed by the Institute for Certified Computing
Professionals (ICCP), the CDMP is a solid, all-round
credential for general database professionals. Many
employers will recognize the acronym on your résumé.
 International Data Management Association (DAMA)
 Institute for Certified Computing Professionals (ICCP)
 The Data Warehousing Institute (TDWI)

You might also like