May 16, 2019, 1:56 p.m.
There is one thing that absolutely drives me crazy in Python and that is the fact that you can access a variable that was defined outside of a function from within the function without passing it as an argument. I'm not going to lie, that does come in handy at times - especially when you are working with APIs; but it is still a terrible way to do things.
I like the way scoping is done in C++ - each variable is only valid within the block in which it is declared, but obviously that doesn't work in Python since we don't declare variables. Even so, I think that the only variables that should be available in a function are the ones which are created in it or passed to it as arguments. While we can't modify variables that were declared outside of a function in Python, only access them, if the variable is an object you can modify it using it's methods. So if we have a list and we append to it in a function using list.append() it will actually modify the list, which is crazy.
Being able to access variables that were declared outside of the function makes it easy to write bugs and makes it hard to find them. There have been several times when I have declared a variable outside of a function and then I pass it in to the function with a different name and then use the external variable instead of the one passed into the function. The code will work, but not as expected, and it is difficult to track things like this down.
This makes me see much of the merit in functional programming languages like Scala.
April 18, 2019, 7:26 a.m.
After another couple of weeks using PyTorch my initial enthusiasm has somewhat faded. I still like it a lot, but I have encountered many disadvantages. For one I can now see the advantage of TensorFlows static graphs - it makes the API easier to use. Since the graph is completely defined and then compiled you can just tell each layer how many units it should have and it will infer the number of inputs from whatever it's input is. In PyTorch you need to manually specify the inputs and outputs, which isn't a big deal, but makes it more difficult to tune networks since to change the number of units in a layer you need to change the inputs to the next layer, the batch normalization, etc. whereas with TensorFlow you can just change one number and everything is magically adjusted.
I also think that the TensorFlow API is better than PyTorch. There are some things which are very easy to do in TensorFlow which become incredibly complicated with PyTorch, like adding different regularization amounts to different layers. In TensorFlow there is a parameter to the layer that controls the regularization, in PyTorch you apparently need to loop through all of the parameters and know which ones to add what amount of regularization to.
I suppose one could easily get around these limitations with custom functions and such, and it shouldn't be surprising that TensorFlow seems more mature given that it has the weight of Google behind it, is considered the "industry standard", and has been around for longer. But I now see that TensorFlow has some advantages over PyTorch.
April 8, 2019, 2:31 p.m.
When I first started with neural networks I learned them with TensorFlow and it seemed like TensorFlow was pretty much the industry standard. I did however keep hearing about PyTorch which was supposedly better than TensorFlow in many ways, but I never really got around to learning it. Last week I had to do one of my assignments in PyTorch so I finally got around to it, and I am already impressed.
The biggest problem I always had with TensorFlow was that the graphs are static. The entire graph must be defined and compiled before it is run and it can't be altered at runtime. You feed data into the graph and it returns output. This results in the rather awkward tf.Session() which must be created before you can do anything, and which contains all of the parameters for the model.
PyTorch has dynamic graphs which are compiled at runtime. This means that you can change things as you go, including altering the graph while it is running, and you don't need to have all the dimensions of all of the data specified in advance like you do in TensorFlow. You can also do things like change the numbers of neurons in a layer dynamically and drop entire layers at runtime which you can't do with TensorFlow.
Debugging PyTorch is a lot easier since you can just make a change and test it - you don't need to recreate the graph and instantiate a session to test it out. You can just run an optimization step whenever you want. Coming from TensorFlow that is just a breath of fresh air.
TensorFlow still has many advantages, including the fact that it is still an industry standard, is easier to deploy and is better supported. But PyTorch is definitely a worth competitor, is far more flexible, and solves many of the problems with TensorFlow.
March 19, 2019, 2:20 p.m.
I was running an AWS Glue job where I was reading Parquet files from an S3 bucket. When I loaded the files individually there were no problems, but when I loaded the entire directory and tried to do any sort of transformation on the data I got this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 1.0 failed 4 times, most recent failure: Lost task 4.3 in stage 1.0 (TID 23, ip-172-31-4-9.eu-west-1.compute.internal, executor 1): org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://...
I couldn't find any useful information about this online, so thought I'd post the solution here in case anyone else has the same problem.
The problem was that some of the files had columns which were entirely Null and apparently Spark doesn't like that. Reading the files individually I guess it probably read the schema from the file, but reading them as a whole apparently caused errors. I solved the problem by dropping any Null columns before writing the Parquet files.