Modern business and scientific research are built on numbers. Lots of numbers. Arrays are processed with different tools. NumPy is one of the most effective and easy to learn. But what is NumPy? And how exactly does it work?
What is NumPy?
NumPy is short for Numerical Python. It’s a Python library for scientific computing that provides efficient and optimized multidimensional arrays, also called ndarrays, and various math functions for working with them.
Working with Python, you will often use libraries. In programming, a library is a set of pre-written code that makes specific tasks easier. Imagine you need to make a lasagna: you can either knead and roll out the dough yourself or use pre-made pasta. It's the same in programming: you can write your own functions from scratch, or you can use existing and proven ones.
NumPy is an essential tool for data analysis, scientific research, and machine learning, as it provides a fast, memory-efficient way to work with large datasets and perform complex mathematical operations.
How NumPy helps analysts and machine learning specialists
What NumPy does is extremely useful for dealing with business and scientific problems. Let's say you work at a store and want to know which products sell best. You have a lot of sales data that barely makes sense at first glance.
NumPy can help you by providing tools for doing calculations on large sets of data. You can use NumPy to calculate things like the total revenue from each product, the average price of each product, and the number of units sold for each product.
By using NumPy to do these calculations, you can quickly understand which products are doing well and which ones aren't. This might help you decide whether you need to order more of some particular products.
NumPy can also be used for more complex analyses, like predicting which products are likely to sell well in the future based on past sales data. For example, if the demand for turkey grew before Thanksgiving in previous years, this year would likely yield the same result. This kind of analysis may help businesses make better decisions about which products are worth investment and how to allocate their resources.
NumPy might also come in handy with cooler things, like machine learning. Extremely useful on its own, NumPy is an integral part of many libraries working with complex calculations and machine learning models, such as Pandas, scikit-learn, TensorFlow, and many others. Artificial intelligence, which underlies many of today's technologies, such as Tesla's or Google's autopilot, is created using machine learning. In a self-driving car, a camera takes pictures of the road and its surroundings, and then a computer analyzes those pictures to figure out what's going on.
To do this, the computer uses a machine learning model that's built using ndarrays in NumPy. The ndarrays represent the pictures that the camera takes. The machine learning model needs to recognize all sorts of things that are relevant to driving, like other cars, pedestrians, stop signs, and traffic lights.
So, for each picture, the computer breaks it down into lots of pixels. Each pixel has a number associated with it that represents its color. The computer then organizes these pixels into a big table called an ndarray, where each row represents a pixel and each column represents a different color channel.
Once the computer has an ndarray for each picture, it can feed them into a neural network, which is like a brain that's designed to recognize patterns. The neural network then analyzes the ndarrays and looks for patterns that are specific to different objects, like cars or pedestrians.
The neural network gradually gets better and better at recognizing these patterns, so the computer tends to recognize objects in the camera's pictures more efficiently. And when it's time for the car to make a decision, like whether to stop at a red light or keep going, the computer uses the information from the machine learning model to make a decision based on what it can gather from pictures. This is also called CNN architecture.
Here's an example of how you could use TensorFlow to build a simple CNN for recognizing patterns in images.
Using ndarrays
Ndarray is a special type of data structure in NumPy, which can be represented as a large table of numbers. It is similar to a spreadsheet, with rows and columns filled with numbers.
This is an example of a two-dimensional array of size 2x3, composed of 4-byte integer elements:
What's special about ndarrays is that they can contain many numbers arranged in a certain way, like in a grid or table. However, these numbers must all be of the same type, such as integers or floating point numbers.
One of the great things about ndarrays is that they are very efficient at performing mathematical operations on them. So, you can add, subtract, multiply, divide, and perform other mathematical operations on huge sets of numbers almost in the blink of an eye. This makes ndarrays really useful for data analysis, scientific computing, and machine learning.
In Python, a list is a built-in data type that might consist of various items, such as numbers, strings, or other lists.
We can call the first two lists one-dimensional, the third is two-dimensional because it contains other lists.
Two-dimensional lists that contain lists of the same length with numbers are called matrices.
You can use a list with numbers to do simple calculations, like adding or subtracting numbers, and you can access individual items in the list by their index.
The main difference between a basic Python list and an ndarray is that ndarrays are specifically designed for doing math and working with large sets of data. And its realization is way faster than standard lists in Python. Autopilot in a self-driving car needs to calculate a huge amount of information per second. If you process all this information with standard Python methods, it is bound to take much longer and the car will not be able to drive well. NumPy handles calculations faster because many of its methods are implemented in a more basic and hardware-friendly C programming language.
Another difference is that ndarrays can have multiple dimensions, which means you can use them to represent more complex data structures, like matrices or images. This could be really useful for tasks like machine learning, where you might need to work with large sets of data that have multiple dimensions.
Let's get started
The first step is to make sure that you have Python installed on your computer. If you don't have one, you can download it for free from the Python website. Just follow the instructions to download and install it on your computer.
After installing Python, you can install NumPy using the package manager, a tool that helps you download and install any libraries.
There are several different package managers which might be used, but the easiest way is to use the one called pip. Pip comes with Python, so you don't need to install it separately.
How to install NumPy using pip:
- Open a command line on your computer. To do this, press Win + R (Windows) or Command + Space (MacOS), and type cmd.
- Once you have the command prompt open, type in the following command:
pip install NumPy
- You have installed NumPy!
Now you can program. To do this comfortably, you should install a development environment like Wing or PyCharm. The website of any IDE has detailed installation instructions. Also, you might want to use an online development environment.
Let's create an ndarray. There are several ways to do this.
The first way creates an ndarray
If you run this code, you will get this output:
[1 2 3 4 5]
Create an ndarray from a list
After running this code, you will get this output:
[1 2 3]
[[1 2 3][4 5 6][7 8 9]]
But keep in mind that ndarrays only work with numeric data, like Integer or Float, so the following code will not work:
Running this code will get you a SyntaxError. This type of error occurs when Python sees that the code does not conform to language rules. According to Python syntax, only numbers can be passed to the np.array() function. It's like putting a glove on your foot - it's not designed for that.
Operations, indexing, and slicing
Operations with ndarrays in NumPy include arithmetic operations, indexing, and slicing.
Arithmetic operations:
Addition (+)
Subtraction (-)
Multiplication (*)
Division (/)
Exponentiation (**)
Modulus (%)
Let’s look at examples.
Indexing
You can get access to individual elements in an ndarray by their index (e.g., arr[0] to access the first element).
Slicing
You can access subsets of an ndarray by ranges of indices (e.g., arr[0:3] to access the first three elements).
Basic Python vs. NumPy
Is it possible to work with arrays of numbers without a special library? Yes, but it is not very handy. Say you have the problem of multiplying each element of an array by three. Let's compare how you can solve it with and without NumPy.
Basic Python way
NumPy way
In each case, you will get [48 105 27 168 264 333], but NumPy makes it easier and faster.
Speed is crucial if you happen to be processing hundreds of thousands of numbers, which is a common task for stuff with big data.
Other array-operating libraries
Pandas
Pandas is a library in Python used for data manipulation and analysis. It's great for working with data tables, like spreadsheets or databases. Pandas provide two main data structures, Series and DataFrame, more powerful than Python's built-in lists.
Pros:
- Pandas has more powerful tools for data manipulation and analysis, like the ability to group data by certain criteria or join different data sets together.
- Pandas is easier to use for working with labeled data, where each column has a name and each row has an index.
- Pandas can handle missing data more easily than NumPy.
Cons:
- Pandas might be slower than NumPy for some operations, especially on large datasets.
- Pandas have more overhead and can be more memory-intensive than NumPy.
In general, you should use Pandas when you're working with tabular data and need to do things like filtering, sorting, or grouping. If you're working with numerical data requiring faster processing, NumPy might be a better choice.
SciPy
SciPy is a Python library used for scientific and technical computing. It is built on top of NumPy and provides additional functionality for tasks such as optimization, integration, interpolation, and more.
Pros:
- SciPy provides a wide range of scientific and mathematical functions not available in NumPy, such as numerical integration, signal processing, and optimization algorithms.
- SciPy has additional data structures and tools that can be useful for scientific calculations, such as sparse matrices and statistical distributions.
- SciPy is built on top of NumPy, so it is easy to use with NumPy for numerical calculations.
Cons:
- Some of SciPy's more advanced functions might be more difficult to use and require a deeper understanding of the underlying mathematics.
- Some of SciPy's more advanced functions may be slower than the simpler NumPy functions, especially for large data sets.
In general, it’s better to use SciPy when you are working with more complex mathematical or scientific problems that require specialized functions, such as optimization or numerical integration. If you're just working with basic numerical calculations, NumPy will probably be enough.
Although these libraries solve different specialized problems, they are largely based on NumPy methods. So it's a good idea to study NumPy before mastering these libraries - that way you'll get a deeper understanding of how they work.
In short
It's not scary to work with numbers using the full power of NumPy. It's a great tool for data processing and getting started in data science. To become a master of numbers, it’s worth learning a few more libraries and basic Python. TripleTen’s Data Science Bootcamp teaches Python, NumPy, and a few more libraries as a part of its skill set. If you’re ready to break into IT, see how TripleTen can assist you today.